Gearman MySQL UDFs: URL Processing

Yesterday we announced the new release of the Gearman server, library, and UDFs, and now I’d like to provide a more useful example. This example assumes you’ve taken the steps to install the C server and MySQL UDFs as explained in that post, and now we’ll be using those pieces and the Perl API to show how to do URL processing. This could be used by a number of applications - anything where you have a URL, need to fetch it, possibly do some processing, and then store it somewhere (RSS feed cache, pulling links out of a HTML page, image conversion, …). In this example, MySQL is being used as the repository and “trigger point”, helping track the requests and results. Since we’ll be using Perl for our worker code, the first thing to do is install the Perl API from CPAN:

> perl -MCPAN -eshell
cpan shell -- CPAN exploration and modules installation (v1.9205)
cpan[1]> install Gearman::Worker
…

This will download and install the modules for you. If you prefer not to use CPAN, you can download the module here and install it manually. With this installed, you can now write Perl workers (and clients too actually) in Perl! Lets give a short example:

#!/usr/bin/perl
use Gearman::Worker;

my $worker = Gearman::Worker->new();
$worker->job_servers('127.0.0.1');
$worker->register_function(reverse => sub { reverse $_[0]->arg; });
$worker->work while 1;

This is equivalent to the C worker I demonstrated yesterday that reversed a string. You could run this Perl script instead of the “reverse_worker” application and the “reverse_client” should have the same success. Now, onto something more useful!

We start off by creating the following table and inserting some values to test with:

use test;
CREATE TABLE url
(
  id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  url VARCHAR(255) NOT NULL,
  content LONGBLOB
);

INSERT INTO url SET url='http://www.drizzle.org/';
INSERT INTO url SET url='http://www.mysql.com/';
INSERT INTO url SET url='http://www.concentric.com/';

We can double check this by running:

mysql> SELECT id,url,LENGTH(content) FROM url;
+----+----------------------------+-----------------+
| id | url                        | LENGTH(content) |
+----+----------------------------+-----------------+
|  1 | http://www.drizzle.org/    |            NULL |
|  2 | http://www.mysql.com/      |            NULL |
|  3 | http://www.concentric.com/ |            NULL |
+----+----------------------------+-----------------+
3 rows in set (0.00 sec)

Now, lets create a Perl worker that will fill in the ‘content’. The code for the this could be something like:

#!/usr/bin/perl
use Gearman::Worker;
use LWP::Simple;

my $worker = Gearman::Worker->new();
$worker->job_servers('127.0.0.1');
$worker->register_function('url_get', \&url_get);
$worker->work while 1;

sub url_get
{
  print $_[0]->handle . ” ” . $_[0]->arg . “\n”;
  get $_[0]->arg;
}

This relies on the the LWP module (”Library for WWW in Perl”) , which you can also install through CPAN or manually. We use the “get()” function to simply fetch the URL and return the content as our result. We’ve also called this Gearman function “url_get”, so when calling it from a Gearman client, we’ll be sure to use that name. Now make sure you still have gearmand running (as was explained yesterday), and then start up this Perl script. Now back to the MySQL command line.

First, make sure you have the Gearman server list set up:

mysql> SELECT gman_servers_set("127.0.0.1");

You only need to run this once at server startup or function creation, the module remembers the server list between sessions. Next, we populate the content table:

mysql> UPDATE url SET content=gman_do("url_get", url);
Query OK, 3 rows affected (1.72 sec)
Rows matched: 3  Changed: 3  Warnings: 0

mysql> SELECT id,url,LENGTH(content) FROM url;
+----+----------------------------+-----------------+
| id | url                        | LENGTH(content) |
+----+----------------------------+-----------------+
|  1 | http://www.drizzle.org/    |            6522 |
|  2 | http://www.mysql.com/      |            3364 |
|  3 | http://www.concentric.com/ |           17284 |
+----+----------------------------+-----------------+
3 rows in set (0.00 sec)

If all went well, you should have seen the “url_get” Perl worker print out the three URLs as it fetched them. Also, we can see the content blobs now hold the bodies of our URLs. Let’s think for a second about what just happened. We used the Gearman UDFs to send a URL to gearmand as a “url_get” job, the job server then fed each of the three jobs to our Perl worker, and then the content body was sent back to the original caller and stored into the ‘url’ table.

One problem we can see right away is that the update took almost two seconds. This is because the URL fetches were executed serially (MySQL runs functions one at a time per row, not concurrently), and we blocked for each. It would be a bit nicer if we could just queue the fetches, possibly allowing them to be run in parallel. We can easily do this using a Gearman background job. This allows you to start a job, but instead of waiting for the result, it returns immediately. The job happens when a worker gets around to it, but this means that we need a different way to get the result. What we can do is have the worker insert the content directly into the table. Let’s see how this new background Perl worker will look:

#!/usr/bin/perl
use Gearman::Worker;
use LWP::Simple;
use DBI;

my $worker = Gearman::Worker->new();
$worker->job_servers('127.0.0.1');
$worker->register_function('url_get_bg', \&url_get_bg);
$worker->work while 1;

sub url_get_bg
{
  print $_[0]->handle . ” ” . $_[0]->arg . “\n”;
  my $content= get $_[0]->arg;

  my $dbh = DBI->connect(”DBI:mysql:test:127.0.0.1″, “root”);
  my $sth = $dbh->prepare(”UPDATE url SET content=? WHERE url=?”);
  $sth->execute($content, $_[0]->arg);
  $sth->finish();
  $dbh->disconnect();
  “”;
}

As you can see, we’re now using the Perl DBI to make a connection back to the MySQL server to update the content. We also just return the empty string for this function, which means nothing is sent back to the job server (just a job completed message). You can stop the old Perl worker, start this one up instead, and then trigger this new worker from MySQL:

mysql> UPDATE url SET content=NULL;
Query OK, 3 rows affected (0.00 sec)
Rows matched: 3  Changed: 3  Warnings: 0

mysql> SELECT gman_do_background("url_get_bg", url) FROM url;
+---------------------------------------+
| gman_do_background("url_get_bg", url) |
+---------------------------------------+
| H:lap:7                               |
| H:lap:8                               |
| H:lap:9                               |
+---------------------------------------+
3 rows in set (0.00 sec)

We first cleared the content to easily see when we get new content from the worker, and then used the Gearman background UDF. This sends the URL and “url_get_bg” to the job server (gearmand), a job handle is created immediately, and then returned. This is what we see in the SELECT output above, the three new job handles. As you can see this ran quickly, but if you were watching the Per worker in the other terminal, you would have noticed that the URLs were fetched in order at about the same speed.

Wait a minute - we were only running one Perl worker, and we had three jobs queued. Let’s start up more workers so they can run concurrently! Start up two more versions of the Perl worker in new terminals or in the background, and try the above SELECT again. If you watch the Perl workers, you should see all three of them grab a job right away, fetch the URL, and inset it back into the ‘url’ table. The work was completed in the time it took the longest running URL to be fetched. Do you want to run even more jobs con currently? Just start up more workers, possibly on different machines. We now have a generic framework setup to run MySQL functions as (possibly distributed) queued background jobs!

Rather than a simple URL fetch, we may have some resource intensive job that you don’t want running on a single machine (like document or image conversion). Rather than pushing a URL around, we could push a shared file system path or memcached key. We could also have a vast library of Perl modules to do pretty much anything we’d like with the worker code (and this can always be re-written in C if performance becomes an issue). Of course, you can use some of the other language APIs if your application would be better suited in another language.

I’m very curious to see what others will come up with since I’ve just scratched the surface of how this framework can be used. If you have come up with a clever use case, be sure to post it on the Gearman Wiki so others can learn!