Posts tagged ‘data quality’

March 2, 2012

Google Maps: Foreign affairs and social skirmishes

Google Earth and Google Maps are probably the most popular, free, online cartography reference tools for the public. Popularity is not the same as authority though[1]:

The lines that Google draws on maps have no government’s imprimatur.

Foreign affairs

Google should not be involved in geopolitical disputes. If Google Maps show borders or place names that are different from official or long-established usage, they can confuse, offend or worse, even if done unintentionally[2]:

On Nov. 3, 2010, a Nicaraguan official justified his country’s incursion into neighboring Costa Rica’s territory by claiming that, contrary to the customary borderline, he wasn’t trespassing. For proof, he [cited] Google Maps.

Google Map art

Map markers away! Fighting the cartographic unknown

Google DOES try to offer meaningful, accurate maps.

read more »

January 21, 2011

Search engine spam

… a decade ago, the spam situation was so bad that search engines would regularly return off-topic webspam. For the most part, Google has successfully beaten that—even while some spammers resort to sneakier or even illegal tactics such as hacking websites. Today, English-language spam in Google’s results is less than half what it was five years ago, and even lower in other languages.

However, we have seen a slight uptick of spam in recent months

We recently launched a document-level classifier that makes it harder for spammy on-page content to rank highly. The new classifier is better at detecting spam on individual web pages, e.g., repeated spammy words—the sort of phrases you tend to see in junky, automated, self-promoting blog comments. We’ve also radically improved our ability to detect hacked sites.

We’ll explore …  new ways for users to give more explicit feedback about spammy and low-quality sites.

As “pure webspam” has decreased over time, attention has shifted instead to “content farms,” which are sites with shallow or low-quality content. In 2010, we launched two major algorithmic changes focused on low-quality sites. Nonetheless, people are asking for even stronger action [on such sites]. We can and should do better.

via Official Google Blog: Google search and search engine spam.

November 15, 2010

Open Source Data Quality Tool

I was surprised to see Google enter an important area that it had not approached before: Data quality.

Google Refine 2.0 was released last week

Google Refine is an open source data quality and data integration tool.  DataQualityPro seemed impressed with RefineRefine is Google’s first “consumer” product* for  data quality.

Google Refine 2.0

Google Refine is a data quality app that runs in your browser

Refine is presented as a tool for especially messy data sets, with inconsistent content, mismatched formatting or units, and in dire need of clean-up for improved referential integrity.

Remember though: This is a free web app!  It isn’t SAS Data Miner. The comments in the DataQualityPro post make that clear. Have a look at them if you want to get an idea of what Refine’s benchmark performance might be. Some of the comments are funny. I suspect that later versions of Google Refine will focus on performance.

Synergies from a Google-built data quality tool

An obvious benefit will be ease of access to certain static databases such as latitude and longitude. Also, there should be fewer discrepancies due to inconsistently defined data formats when working with Google-maintained data sets. Compatibility with Google’s other open-source applications is interesting to contemplate, though not certain.

Google posted three, pleasingly brief (under 15 minutes each) “how-to” videos for Refine users:

  • Introduction
  • Data Transformation
  • Data Augmentation

This is the first of the series:

The other two are also available on YouTube.

If this is version 2.0, what was version 1.0?

I do not know if there was a Google Refine 1.0. Nor could I find any reference to Google deprecating an earlier version of Refine, which was somewhat odd. Perhaps version 1.0 was internal-use only.

Please leave a comment if you have any ideas!

UPDATE: June 2011

The predecessor to Google Refine 2.0, call it Google Refine 1.0 if you will, was Gridworks! Gridworks is a data quality tool that I associated exclusively with Freebase

Here’s some background: Freebase is a large open-use database which is designed for semantic as well as algorithmic or machine search. Gridworks was developed by Metaweb for use with Freebase. Google acquired Metaweb Technologies in late June 2010. I found the connection between Refine 1.0 and Gridworks only a few moments ago, while browsing through a Gridworks write-up on The Chicago Tribune data blog. It was dated 17 May 2010, before Google announced any intent to purchase Metaweb.

*There are other Google data quality projects such as BigTables. But BigTables is for “Big Data” or applications development, unlike Refine.

October 27, 2010

Making the Case for Better Data Quality in Healthcare

Will better data quality improve patient quality of care? The answer should certainly be a strong affirmative.

Tony Fisher of fills in all the details about how better Healthcare Data Quality will have a direct positive impact on patient care. He gives solid reasons to support this.

Funding for data management and data integrity is often stymied because it is difficult to quantify the benefits. Data initiatives are perceived as useful general measures, to be born as marginal costs. In fact, better data quality yields quality of service improvements and contributes to ROI.

This is a reminder to more completely assess the Google Health product, which is available but not widely promoted, particularly not to consumers. Note that Google Health is not subject to HIPAA regulations.