Archive for November 15th, 2010

November 15, 2010

Open Source Data Quality Tool

I was surprised to see Google enter an important area that it had not approached before: Data quality.

Google Refine 2.0 was released last week

Google Refine is an open source data quality and data integration tool.  DataQualityPro seemed impressed with RefineRefine is Google’s first “consumer” product* for  data quality.

Google Refine 2.0

Google Refine is a data quality app that runs in your browser

Refine is presented as a tool for especially messy data sets, with inconsistent content, mismatched formatting or units, and in dire need of clean-up for improved referential integrity.

Remember though: This is a free web app!  It isn’t SAS Data Miner. The comments in the DataQualityPro post make that clear. Have a look at them if you want to get an idea of what Refine’s benchmark performance might be. Some of the comments are funny. I suspect that later versions of Google Refine will focus on performance.

Synergies from a Google-built data quality tool

An obvious benefit will be ease of access to certain static databases such as latitude and longitude. Also, there should be fewer discrepancies due to inconsistently defined data formats when working with Google-maintained data sets. Compatibility with Google’s other open-source applications is interesting to contemplate, though not certain.

Google posted three, pleasingly brief (under 15 minutes each) “how-to” videos for Refine users:

  • Introduction
  • Data Transformation
  • Data Augmentation

This is the first of the series:

The other two are also available on YouTube.

If this is version 2.0, what was version 1.0?

I do not know if there was a Google Refine 1.0. Nor could I find any reference to Google deprecating an earlier version of Refine, which was somewhat odd. Perhaps version 1.0 was internal-use only.

Please leave a comment if you have any ideas!

UPDATE: June 2011

The predecessor to Google Refine 2.0, call it Google Refine 1.0 if you will, was Gridworks! Gridworks is a data quality tool that I associated exclusively with Freebase

Here’s some background: Freebase is a large open-use database which is designed for semantic as well as algorithmic or machine search. Gridworks was developed by Metaweb for use with Freebase. Google acquired Metaweb Technologies in late June 2010. I found the connection between Refine 1.0 and Gridworks only a few moments ago, while browsing through a Gridworks write-up on The Chicago Tribune data blog. It was dated 17 May 2010, before Google announced any intent to purchase Metaweb.

*There are other Google data quality projects such as BigTables. But BigTables is for “Big Data” or applications development, unlike Refine.