7.19.2010

Efficient Rule Based Scheduling

Recently at work we had a project to convert a number of myisam latin 1 databases to utf8. The process we settled on was to lvm snapshot the databases before we started the conversions so that if something went wrong we could mount the snapshots and restore quickly. The problem that arose from this was limited space for the snapshots. We needed to break the conversions up so that we were never converting more data than the size of the snapshots we were creating. I needed to come up with a schedule for doing the conversion, to maximize the number of conversions that could be undertaken at a time, while maintaining the rule of not allowing more data to change than the size of the snapshot.

The answer was simple enough, collect the sizes of all the databases, sort them largest to smallest and then start scheduling each round by taking the largest database off the top of the list and start adding the smallest databases from the bottom of the list until we hit our max size limit.

As per my usual, I solved this with perl. Here is what it looked like;

First lets get a list of the databases

my $dbh = DBI->connect( $dsn, $user, $password, { RaiseError => 1, AutoCommit => 0 } );
my $dbs = $dbh->selectcol_arrayref(q{SHOW DATABASES});
$dbh->disconnect;


Now lets get the sizes for each


my @dbsizes;
foreach my $db ( @$dbs ) {
next if grep /^$db$/, @skip;
opendir my $fh, "$dbpath/$db" or die $!;
my @files = readdir($fh);
# Add up the file sizes for all files in the directory
# there are no subdirectories or this would be to be recursive
my $t;
map { my @s = stat "$dbpath/$db/$_" or die $!; $t += $s[7]; } @files;
push(@dbsizes, [$t,$db]) if defined($t);
}


And now lets sort the list by database size


my @sites = sort { $a->[0] <=> $b->[0]; } @dbsizes;


Lets get to the actual scheduling. Here I'm using a recursive function that will build a reference to a hash keyed by round with an array of databases to convert.


# Recursive function, builds a schedule for database conversion so that no
# Round of conversions exceeds $max
sub build_schedule {
my ($sites,$top,$round,$total,$schedule) = @_;
$top = 1 unless defined($top);
$round = 1 unless defined($round);
$total = 0 unless defined($total);
my $next_site = $top ? shift @$sites : pop @$sites;
$top = 0;
return $schedule unless $next_site;
my ($size, $site) = @$next_site;
if (($total + $size) > $max_size_per_round || $#{$schedule->{round}} > $max_sites_per_round) {
if ($schedule->{$round}[0]) {
unshift @$sites, $next_site;
}
else {
push(@{$schedule->{$round}}, $site);
}
$total = 0;
$top = 1;
$round++;
}
else {
push(@{$schedule->{$round}}, $site);
$total += $size;
}
build_schedule($sites,$top,$round,$total,$schedule);
}



Finally we'll print the schedule

my $schedule = build_schedule(\@sites);
# Print schedule
foreach my $round ( sort { $b <=> $a } keys %$schedule ) {
print "Round $round\n\t";
print join("\n\t", @{$schedule->{$round}} );
print "\n";
}



Here is the complete script

#!/usr/bin/perl -w
use DBI;
use strict;
my $dsn = q{dbi:mysql:mysql:localhost:3306};
my $user = q{root};
my $password = q{password};
my $max_size_per_round = 4_000_000_000; # Max size for each round
my $max_sites_per_round = 5; # Max number of sites for each round
my $dbpath = q{/var/lib/mysql};
my @skip = qw{ information_schema mysql };

# Get a list of all databases on the server
my $dbh = DBI->connect( $dsn, $user, $password, { RaiseError => 1, AutoCommit => 0 } );
my $dbs = $dbh->selectcol_arrayref(q{SHOW DATABASES});
$dbh->disconnect;

# Get the size on disk for each database;
my @dbsizes;
foreach my $db ( @$dbs ) {
next if grep /^$db$/, @skip;
opendir my $fh, "$dbpath/$db" or die $!;
my @files = readdir($fh);
# Add up the file sizes for all files in the directory
# there are no subdirectories or this would be to be recursive
my $t;
map { my @s = stat "$dbpath/$db/$_" or die $!; $t += $s[7]; } @files;
push(@dbsizes, [$t,$db]) if defined($t);
}

# Sort dbs in order of largest to smallest
my @sites = sort { $a->[0] <=> $b->[0]; } @dbsizes;
my $schedule = build_schedule(\@sites);

# Print schedule
foreach my $round ( sort { $b <=> $a } keys %$schedule ) {
print "Round $round\n\t";
print join("\n\t", @{$schedule->{$round}} );
print "\n";
}


# Recursive function, builds a schedule for database conversion so that no
# Round of conversions exceeds $max
sub build_schedule {
my ($sites,$top,$round,$total,$schedule) = @_;
$top = 1 unless defined($top);
$round = 1 unless defined($round);
$total = 0 unless defined($total);
my $next_site = $top ? shift @$sites : pop @$sites;
$top = 0;
return $schedule unless $next_site;
my ($size, $site) = @$next_site;
if (($total + $size) > $max_size_per_round || $#{$schedule->{$round}} > $max_site_per_round ) {
if ($schedule->{$round}[0]) {
unshift @$sites, $next_site;
}
else {
push(@{$schedule->{$round}}, $site);
}
$total = 0;
$top = 1;
$round++;
}
else {
push(@{$schedule->{$round}}, $site);
$total += $size;
}
build_schedule($sites,$top,$round,$total,$schedule);
}

7.15.2010

Process an SQL Dump one Statement at a Time ( Grep for Text Over Multiple Lines)

Recently I was asked how to loop through a MySQL dump file one statement at a time, given that the statements in a dump file are broken up over multiple lines. There are any number of ways of doing this, but since I love Perl, that's usually my first stop.

The answer in Perl is simple. Change the end of line ( $/ ) and slurp the file into an array. So for a MySQL dump, the actual record separator is a semi-colon and a \n.


open( my $fh, "<", "Dump.sql") or die;

my @file;

{

local $/ = ";\n";

my @file = <$fh>;

}


Now unless ';\n' somehow ended up in the middle of an SQL statement, you have an array of SQL statements, one statement per element.

There is one problem with this method. It uses a lot of memory. As least as much as the size of your file. If you're dealing with large files, you'll want to process the file a line at a time instead. Tie::File comes in handy here.

use Tie::File;

my @file;

{

local $/ = ";\n";

tie @file, 'Tie::File', "Dump.sql" or die;

}


Now you can use this @file array to loop through the file, one SQL statement at a time, without having more than one SQL statement in memory at a time. Now say you wanted to find a pattern and evaluate one statement at a time.

my (@matches ) = grep /($pattern)/, @file;

print "Found $#matches matches\n";