In the last post, the data from the on-time flight database was
loaded in a column-orientated storage engine. Now the numbers can
be crunched.
The original goal of this exercise was to find the flight from
Los Angeles International Airport, LAX, to Dallas Fort Worth
International Airport, DFW, that was the most likely to arrive
on-time.
The data is 'opportunity rich' in that there is a lot information
in there. It is easy to start wondering about the various nuggets
of information in there. Are their certain aircraft (tail
numbers) that are routinely bad performers? Are some days of the
week better than others? Do national holidays have an effect on
the on-time performance? If you are delayed, is there a 'regular
amount' of delay? Does early departure make for an early arrival?
Can the flight crew make up for a late departure? How much time
is usually spend on runways?
But to look for the flight from LAX …
There is not much new content this week. Just summary of what was written before.
Summary.
In the summary of this part I'd want to recommend
following:
Try to find a query which causes inconsistent behavior
Use log files:
-
- General query log
- Log files in your application
- MySQL Proxy or any other suitable proxy
- Others
Analyze what is wrong, then solve the problem using your
findings
Below I place a list of methods which we studied in the first
part.
Method #1: use output operator to output query in exactly same way in which RDBMS gets it. …
[Read more]
MySQL transforms IN predicates into nearly-equivalent EXISTS
predicates, and uses a set of mechanisms to catch all special
cases. This transformation, outlined below, also applies to NOT
IN predicates since a NOT IN b parses to NOT a IN b. Expressions
such as
SELECT * FROM table1 WHERE column1 IN (SELECT column2 FROM
table2)
get transformed into
SELECT * FROM table1 WHERE EXISTS ( SELECT column2 FROM table2
WHERE column1 = column2 OR column2 IS NULL )
since the latter can be executed more efficiently. For instance
an index can be used for accessing only the rows of table2 that
will actually participate in the final result. The added WHERE
condition is called a guarded condition, or trigger condition,
both terms are used. In order to get correct result, the
condition column1 = column2 must be deactivated if we are
comparing against a NULL value for column1. This guarantees
correct results for most cases but there are some …
MySQL performance depends on a balanced usage of MySQL indexes. While it is easy to add an index and identify queries not using indexes via EXPLAIN during development or slow.log it is a lot harder to get rid of unused indexes. Finding and removing them might be crucial for your performance as indexes can create a remarkable cpu cycle and i/o overhead during updates to tables (INSERT/UPDATE/DELETE).
The default MySQL community edition server from mysql.com or your Linux/BSD distribution (which you shouldn't use for a lot of reasons anyway) is not yet helpfull in this regard. There are however inofficial patches for advanced statistics that provide the details needed for optimizing your list of indexes. The easiest way to get started with a patched MySQL server is using a pre-patched binary. At Moviepilot an OurDelta's pre-patchted MySQL 5.0 server that includes the …
[Read more]In this post two weeks ago, I explained why the semantics of normal ad-hoc insertions with a primary key are expensive because they require disk seeks on large data sets. Towards the end of the post, I claimed that it would be better to use “replace into” or “insert ignore” over normal inserts, because the semantics of these statements do NOT require disk seeks. In this post, I explain how the command “replace into” can be fast with fractal trees.
The semantics of “replace into” are as follows:
-
- if the primary (or unique) key does not exist, insert the new
row
- if the primary (or unique) key does exist, overwrite the
existing row with the new row
The slow, expensive way B-trees use to implement these semantics
are:
There is not much new content this week. Just summary of what was written before.
Summary.
In the summary of this part I'd want to recommend
following:
Try to find a query which causes inconsistent behavior
Use log files:
-
- General query log
- Log files in your application
- MySQL Proxy or any other suitable proxy
- Others
Analyze what is wrong, then solve the problem using your
findings
Below I place a list of methods which we studied in the first
part.
Method #1: use output operator to output query in exactly same way in which RDBMS gets it. …
[Read more]
Ken North, writing in Dr. Dobb's Journal, gives a nice overview of the long and storied history of
SQL. The piece helps one understand the wave of mergers among
the big database vendors, and make sense of current trends in
database and database-like software. And I'd like to offer my
opinion about where SQL and database management systems are
headed.
North looks into the claims that 'the database is dead' and finds
that — yet again — reports of its death were greatly
exaggerated:Forrester Research recently estimated the total
database market (licenses, support, consulting) would grow from
$27 billion in 2009 to $32 billion by 2012. SQL technology is
entrenched in many organizations and across millions of web
sites. Perhaps that explains why, during the past decade, IBM,
Oracle, Sun and SAP made billion-dollar investments in a …
Elephants on parade: Hadoop goes mainstream. And more.
Follow 451 CAOS Links live @caostheory on Twitter and
Identi.ca
“Tracking the open source news wires, so you don’t have
to.”
Elephants on parade
# Cloudera launched v3 of its Distribution for Hadoop and
released v1 of
Cloudera Enterprise.
# Karmasphere released new Professional and Analyst Editions of its Hadoop development and deployment studio.
# Talend announced that its Integration Suite now offers native support for Hadoop.
# Yahoo …
[Read more]Tracking down InnoDB lock information using the new Information Schema tables provided with the InnoDB plugin has never been easier.
Long story short, the other day I was trying to identify what transaction was holding the lock for a particular UPDATE. The UPDATE would not complete and kept timing out with “Lock wait timeout exceeded; try restarting transaction”.
Of course I checked the output of SHOW ENGINE INNODB STATUS. From that output, I could tell it was the replication thread holding the lock, *since* it was the only other transaction running. But I could not verify this with output, it just had to be the case.
Next I enabled the InnoDB lock monitor, and examined that output too. Here, it showed more detail on the locks being held, however, it only shows the first 10 locks held by the replication thread. In this case, the …
[Read more]In the world of relational database, we are all aware of the benefits that row oriented databases deliver to businesses. They have been designed to support the relational model, have driven the broad adoption of an approachable open language for managing data (SQL), and have been the foundation for high transaction applications including online airline reservation systems, massive online commerce solutions and massively multiplayer online games.
Over the course of many years these entrenched row oriented databases have been challenged by the likes of object/relational, NoSQL solutions and even Map/Reduce (though it is certainly not a database). Sometimes row-oriented databases are either overkill or not a suitable fit for the task at hand. NoSQL differs significantly from the row oriented relational database, especially when it comes to dealing with large volumes of unstructured data. As an example it may be useful for rapidly storing and …
[Read more]