I almost forgot I wrote the code a while back. Someone asked me
about it yesterday, so I dusted the parallel CSV reader code off
this morning and here are the results:
This test basically reads a file with 10M customer records
(generated), sized 919169988 bytes in 18.3 seconds. (50MB/s)
Obviously, my poor laptop disk can’t deliver at that speed, so
these test results are obtained by utilizing the excellent Linux
caching system
In any case, the caching system simulates faster disk subsystem.
On my computer, the system doesn’t really scale linearly
(especially in this case, the OS uses up some CPU power too) ,
but the speedup is noticeable from 25.8 to 18.3 seconds. (about
30% faster)
The interesting thing is that if you have more CPUs at your
disposal (both SMP and clustered setups work) you can probably
make it scale to the full extent of your disk speed.
In the case where lazy conversion is disabled …
[Read more]