Showing entries 1 to 7
Displaying posts with tag: data mining (reset)
Getting Data into Hadoop in real-time

Moving data between databases is hard. Without ever intending it, I seem to have spent a lifetime working on solutions for getting data into and out of databases, but more frequently between. In fact, my first job out of university was migrating data from BRS/Text, a free-text database (probably what we would call a NoSQL) into a more structured Oracle.

Today I spend some of my time working in Big Data, more often than not, migrating information from existing data stores into Big Data so that they can be analysed, something I covered in more detail here:

http://www.ibm.com/developerworks/library/bd-sqltohadoop1/index.html
http://www.ibm.com/developerworks/library/bd-sqltohadoop2/index.html

[Read more]
Four short links: 21 October 2010
  1. Using MysQL as NoSQL -- 750,000+ qps on a commodity MySQL/InnoDB 5.1 server from remote web clients.
  2. Making an SLR Camera from Scratch -- amazing piece of hardware devotion. (via hackaday.com)
  3. Mac App Store Guidelines -- Apple announce an app store for the Macintosh, similar to its app store for iPhones and iPads. "Mac App" no longer means generic "program", it has a new and specific meaning, a program that must be installed through the App store and which has limited functionality …
[Read more]
Four short links: 10 December 2009
  1. Scriblio -- open source CMS and catalogue built on WordPress, with faceted search and browse. (via titine on Delicious)
  2. Useful Temporal Functions and Queries -- SQL tricksies for those working with timeseries data. (via mbiddulph on Delicious)
  3. Optimal Starting Prices for Negotiations and Auctions --Mind Hacks discussion of a research paper on whether high or low initial prices lead to higher price outcomes in negotiations and online auctions. Many negotiation books recommend waiting for the other side to …
[Read more]
Four short links: 1 December 2009
  1. Apertus -- open source cinema camera. (via joshua on Delicious)
  2. A Survey of Collaborative Filtering Techniques -- From basic techniques to the state-of-the-art, we attempt to present a comprehensive survey for CF techniques, which can be served as a roadmap for research and practice in this area. (via bos on Delicious)
  3. Drizzle Replication using RabbitMQ as Transport -- we're watching the growing use of message queues in web software, and here's an interesting application. (via …
[Read more]
Four short links: 26 October 2009
  1. Toiling in the Data Mines -- Tom Armitage describes the process that Berg calls "material exploration". Programmers very rarely talk about what their work feels like to do, and that's a shame. Material explorations are something I've really only done since I've joined BERG, and both times have felt very similar - in that they were very, very different to writing production code for an understood product. They demand code to be used as a sculpting tool, rather than as an engineering material, and I wanted to explain the knock-on effects of that: not just in terms of what I do, and the kind of code that's appropriate for that, but also in terms of how I feel as I work on these explorations. Even if the section on the code itself feels foreign, I hope that the explanation of what it …
[Read more]
Business Intelligence for the People



Business intelligence has been talked about for quite a while. Even today, while companies are looking to make budget cuts, some experts are saying that BI can be used to beat the recession.

When I hear about BI systems, the first thing that comes to my mind is a huge and expensive system that has very powerful servers, that sucks data from many sources and runs some intensive and even more expensive reporting suite. Since I had been involved in projects to set those systems up, I know that it can probably take around a year to complete.

So everyone is in fact thinking about saving money yet still being …

[Read more]
SQL Puzzle

Dear lazyweb,

I want to mine a code repository for data to map past bugs to sourcecode files.

I have written a small PHP script (the initial version of the script can be found here) to import the relevant data from a Subversion repository into the following tables of a relational database:

bugs            changes         paths
--------        --------        -------
bug_id          path_id   <-->  path_id
revision  <-->  revision        path

What I need now is two queries to ask the database for

  • paths that are most commonly changed during bugfix commits and
  • paths that are commonly changed together …
[Read more]
Showing entries 1 to 7