Have you ever had a spreadsheet file or a large .csv file that you wanted to manipulate, but you want more power than a spreadsheet program could offer?
Before I started using MySQL, I would usually throw the .csv file into a desktop database program, like FileMaker. FileMaker would allow you to import the .csv file and it would automatically create the column headers for you. Recently, I was given a spreadsheet with 27,000 rows in it. I still use FileMaker for some databases, but I wanted the power of MySQL to manipulate the information contained in this file. So, I could have easily just typed out the database column names manually into a MySQL “create table” statement, guessed at the types and sizes of the columns and then imported the .csv file. Instead, I decided to write a Perl script to do the dirty work for me. Plus, this spreadsheet had 45 columns of varying lengths. Ouch.
Please keep in mind that this was a quick hack. I did some testing on some small data, and then crunched the big spreadsheet. Everything seemed to work out okay, but I did get a bunch of Perl errors “Use of uninitialized value…”. So, I just turned off “warnings” and proceeded. Feel free to show me how to correct these errors – I just didn’t want to take the time. I am not an expert at writing Perl code anyway.
The script is fairly simple. The .csv file has to have header
information in order to be able to create the column names, the
delimiter must be double-quotes and a comma – “,” – and so each
line must look something like this:
"Name","Address","City","State","Zip","Phone","Amount1","Amount2","Amount3"
There are times when you will have a .csv file that doesn’t have the full “,” delimiter – such as when a column is a number, the “s are left off and only a comma is used as the delimiter – and I could have incorporated that functionality into the script, but my .csv file had the full delimiter for both numbers and text.
I created a test .csv file, and this is the data that I used to
initially test the script:
"Name","Address","City","State","Zip","Phone","Amount1","Amount2","Amount3"
"Tony Davidson","100 Main
Street","Suwanee","Georgia","30024","6785552222","440.03","4522.00","82837"
"Jack Smith","5000 Washington
Ave.","Chattanooga","California","31069","(770)5552941","400","456.787","2828"
"Paul Davis","P.O. Box
AB123","Jackson","Georgia","31069","(770)555-6532","325.1","837.3","3848"
"Jacky Smith-Davidson","400 Metropolitan
Ave","Macon","Georgia","30024-2032","770-555-8987","503.920","392.4","292833"
"Davd Jackson","100 Oak
Drive","Savannah","Tennessee","40024","770) 555-
8941","5492.0","15674.01","43"
The script only determines three data types – varchar, integer and decimal. If your data doesn’t fit into these three types, or if you want to use different data types, then the program probably won’t work for you. The script calculates the longest length for each column, so if you will be changing and increasing the length of the data in the columns, you will want to edit the length of the columns in the “create table” statement that is created.
Once the script was run against the above data, two files were created – mysql_create_table.sql, which is the SQL that you can use to create the actual table – and mysql_data_values.sql, which is the “insert into” file that you can use to load the data into the table.
Here is the SQL statement that was created by the script
(mysql_create_table.sql), which can use used to create the table
for the data above:
CREATE TABLE `Addresses` (
`Name` varchar (20),
`Address` varchar (20),
`City` varchar (11),
`State` varchar (10),
`Zip` varchar (10),
`Phone` varchar (14),
`Amount1` decimal (7,3),
`Amount2` decimal (8,3),
`Amount3` int (6)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
And here is the “insert” data (mysql_data_values.sql) that is
created that you can use to load the database:
insert into Addresses (Name, Address, City, State, Zip, Phone,
Amount1, Amount2, Amount3)
values ('Tony Davidson', '100 Main Street', 'Suwanee', 'Georgia',
'30024', '6785552222', '440.03', '4522.00', '82837');
insert into Addresses (Name, Address, City, State, Zip, Phone,
Amount1, Amount2, Amount3)
values ('Jack Smith', '5000 Washington Ave.', 'Chattanooga',
'California', '31069', '(770)5552941', '400', '456.787',
'2828');
insert into Addresses (Name, Address, City, State, Zip, Phone,
Amount1, Amount2, Amount3)
values ('Paul Davis', 'P.O. Box AB123', 'Jackson', 'Georgia',
'31069', '(770)555-6532', '325.1', '837.3', '3848');
insert into Addresses (Name, Address, City, State, Zip, Phone,
Amount1, Amount2, Amount3)
values ('Jacky Smith-Davidson', '400 Metropolitan Ave', 'Macon',
'Georgia', '30024-2032', '770-555-8987', '503.920', '392.4',
'292833');
insert into Addresses (Name, Address, City, State, Zip, Phone,
Amount1, Amount2, Amount3)
values ('Davd Jackson', '100 Oak Drive', 'Savannah', 'Tennessee',
'40024', '770) 555- 8941', '5492.0', '15674.01',
'43');
Here is the MySQL command that you can use to import your .sql
file into the MySQL database:
mysql -u -p database_name <
mysql_data_values.sql
You will want to edit the script and edit/enter your relevant
information – such as .csv file name, output file names and
storage engine.
I tried this on my 27,000 line .csv file, and it appeared to work. If the data in the .csv file had been a bit cleaner, I could state that it positively did work, but there was so much data that it was difficult to check. The theory is that with this script, you can take a unwieldy .csv file, import it into MySQL, do your work, and then export it again as a .csv file. And, you can always write a script to automatically execute the creation of your table and the importing of the data.
So, feel free to give this a shot and let me know if it worked for you or not.
#!/usr/bin/perl
use File::Copy;
#use warnings;
# edit these next lines to match your preferences
$TABLE_NAME = "Addresses";
$DATABASE_ENGINE = "InnoDB";
$DEFAULT_CHARSET = "latin1";
# enter the file name that you want to convert to the SQL
statements and values
$filename = "test.csv";
# create your output files
open(TABLE, ">mysql_create_table2.sql") || die "Can't redirect
stdout";
open(VALUES, ">mysql_data_values2.sql") || die "Can't redirect
stdout";
# count how many lines are processed
$count = 0;
# set to be blank
$Columns_Values = "";
# open the file to be processed
open FILE, "$filename" or die $!;
# assign the first line which contains headers to the variable
$columns
my $columns = <FILE>;
# remove carriage return (for Windows, if you have a CR and a LF,
you will need to chop twice)
chop $columns;
# uncomment the next line for Windows files
#chop $columns;
# check to see if the field contains a ' - and if so, add a slash
\ in front
$columns =~ s/'/\\'/g;
# ...remove the first " and then "," will be our delimiter
$columns =~ s/\"//;
# ...remove the last " from the end of the line so that "," will
be our delimiter
chop $columns;
# remove spaces, add an underscore _
$columns =~ s/ /_/g;
#print "$columns\n";
# split first line into individual field names
@Field_Names = split("\",\"",$columns);
# total number of field names
$Field_Names_Count = $#Field_Names;
# add one to the $Field_Names_Count
$Field_Names_Count_Plus_One = $Field_Names_Count + 1;
# start the field count at zero
$field_count = 0;
# create the column names (values) for the "insert into" part of
the SQL statement
if ($count == 0)
{
$column_count = 0;
while ($column_count <=
$Field_Names_Count)
{
if ($column_count <
$Field_Names_Count)
{
$Columns_Values
= $Columns_Values . $Field_Names[$column_count] . ", ";
}
if ($column_count ==
$Field_Names_Count)
{
$Columns_Values
= $Columns_Values . $Field_Names[$column_count];
}
$column_count++;
}
# end if ($count == 0)
}
$count = 0;
# continue to parse the rest of the file which contains the
data
while (<FILE>)
{
# remove the carriage return
chomp $_;
# remove the first " and then...
$_ =~ s/\"//;
# ...remove the last " from the end of the line so that "," will
be our delimiter
chop $_;
# split the first line into what will be used as the column
names
@Field_Values = split("\",\"",$_);
while ($field_count <= $Field_Names_Count )
{
# check to see if the field contains a ' - and
if so, add a slash \ in front
$Field_Values[$field_count] =~ s/'/\\'/g;
# if a
field is blank, set it to zero, and then remove the zero
later
if
(length($Field_Values[$field_count]) < 1)
{
$Field_Values[$field_count]
= "0";
}
# check to
see if the field value contains any alphabet characters
if (
$Field_Values[$field_count] =~ m/[a-zA-Z]/)
{
$type[$field_count]
= "varchar";
#
find the longest length of the data in the column
if
($length[$field_count] <
'length($Field_Values[$field_count])')
{
$length[$field_count]
= length($Field_Values[$field_count]);
}
}
# once a field has been designated as a
varchar, we don't need to test it any further
# as we aren't going to change a varchar field
back to a number or decimal field
if ($type[$field_count] ne "varchar")
{
# check to
see if the field value does not contain any alphabet
characters
if (
$Field_Values[$field_count] =~ m/[^a-zA-Z]/)
{
#
if the field was already determined to be a decimal, then keep it
a decimal
#
if not, then set it to be a number
if
($type[$field_count] ne "decimal")
{
$type[$field_count]
= "int";
#
find the longest length of the data in the column
if
($length[$field_count] lt
'length($Field_Values[$field_count])')
{
$length[$field_count]
= length($Field_Values[$field_count]);
}
}
}
# if the
field contains numbers and a period
if (
$Field_Values[$field_count] =~ m/[0-9.]/)
{
@count_periods
= split("\\.",$Field_Values[$field_count]);
$number_of_periods
= $#count_periods;
#
if there are two periods in the field, then it is a varchar
if
($number_of_periods > 1)
{
$type[$field_count]
= "varchar";
#
check for the length of the field to make sure we have the
highest field length
if
($length[$field_count] <
'length($Field_Values[$field_count])')
{
$length[$field_count]
= length($Field_Values[$field_count]);
}
#
set these values to be zero - in case the previous field
contained a decimal number
$decimal_length1[$field_count]
= "";
$decimal_length2[$field_count]
= "";
}
#
if there is only one period in the field, then it is a decimal
with X number of decimal places
if
($number_of_periods == 1)
{
$type[$field_count]
= "decimal";
#
split the number to find out the length of each side of the
decimal
#
example 1234.56 = 4,2
@split_decimal_number
= split("\\.",$Field_Values[$field_count]);
#
find the length of each side of the decimal and keep the highest
value
#
this is for the number to left of the decimal
if
($decimal_length1[$field_count] lt
length($split_decimal_number[0]))
{
$decimal_length1[$field_count]
= length($split_decimal_number[0]);
}
#
find the length of each side of the decimal and keep the highest
value
#
this is for the number to right of the decimal
if
($decimal_length2[$field_count] lt
length($split_decimal_number[1]))
{
$decimal_length2[$field_count]
= length($split_decimal_number[1]);
}
#
end if ($number_of_periods == 1)
}
# end if (
$Field_Values[$field_count] =~ m/[0-9.]/)
}
# if the
field contains anything else besides a 0-9 or a period (.)
if (
$Field_Values[$field_count] =~ m/[^0-9.]/)
{
$type[$field_count]
= "varchar";
#
find the longest length of the data in the column
if
($length[$field_count] lt
'length($Field_Values[$field_count])')
{
$length[$field_count]
= length($Field_Values[$field_count]);
}
# end if (
$Field_Values[$field_count] =~
m/[^0-9.]/)
}
# end if ($type[$field_count] ne
"varchar")
}
else
{
#
check for the length of the field to make sure we have the
highest field length
if
($length[$field_count] <
length($Field_Values[$field_count]))
{
$length[$field_count]
= length($Field_Values[$field_count]);
}
# end else
}
# uncomment this line if you want to see the
data being processed - as well as another line below
# print "$Field_Values[$field_count]
$type[$field_count] $length[$field_count]
$decimal_length1[$field_count] $decimal_length2[$field_count] |
";
# if a
field is blank, we set it to zero earlier, now we remove the
zero
if
(length($Field_Values[$field_count]) < 1)
{
$Field_Values[$field_count]
= "";
}
# create the syntax needed
for the "insert into" statement
if ($field_count == 0)
{
print
VALUES "insert into $TABLE_NAME ($Columns_Values) \nvalues
('$Field_Values[$field_count]'";
}
if
($field_count > 0 && $field_count <
$Field_Names_Count_Plus_One)
{
print
VALUES ", '$Field_Values[$field_count]'";
}
$field_count++;
# end while ($field_count
< $Field_Names_Count_Plus_One )
}
# check for
last entry and then start over on next line
if
($field_count == $Field_Names_Count_Plus_One)
{
$field_count
= 0;
$count++;
#
close the print statement for the column values
print
VALUES ");\n";
}
# uncomment this line if you want to see the
data being processed
# print "\n";
# end while (<FILE>)
}
# print the create table statement
print TABLE "\n\nCREATE TABLE `$TABLE_NAME` (\n";
$count_columns = 0;
# loop through the columns and print the type and length for
each
while ($count_columns < $Field_Names_Count_Plus_One)
{
# make sure that we don't have a blank field
value
if (length($Field_Names[$count_columns]) >
0)
{
if ($type[$count_columns] =~
"decimal")
{
$decimal_field_length
= $decimal_length1[$count_columns] +
$decimal_length2[$count_columns];
print TABLE
" `$Field_Names[$count_columns]` $type[$count_columns]
($decimal_field_length,$decimal_length2[$count_columns])";
}
else
{
print TABLE
" `$Field_Names[$count_columns]` $type[$count_columns]
($length[$count_columns])";
}
if ($count_columns <
$Field_Names_Count)
{
print TABLE
",\n";
}
if ($count_columns ==
$Field_Names_Count_Plus_One)
{
print TABLE
"\n\n";
}
# end if (length($Field_Names[$count_columns])
> 0)
}
$count_columns++;
# end while ($count_columns <
$Field_Names_Count_Plus_One)
}
# print an output to show how many lines were processed
print "Processed $column_count columns and $count lines.\n";
print TABLE "\n) ENGINE=$DATABASE_ENGINE DEFAULT CHARSET=$DEFAULT_CHARSET\n";
print TABLE "\n\n";
close(FILE);
exit;
—————————————–
Tony Darnell is a Principal Sales Consultant for MySQL, a division of
Oracle, Inc.
MySQL is the world’s most popular open-source database
program. Tony may be reached at info [at] ScriptingMySQL.com and on LinkedIn. |