Depth Reporting

Showing posts with label Data quality. Show all posts
Showing posts with label Data quality. Show all posts

Monday, October 15, 2007

Social Science Statistics Blog: Visualization for data cleaning

Data cleaning is a boring, annoying task that is difficult, if not impossible, to do perfectly. Andy Eggers at the Social Science Statistics Blog explains how he and a colleague used data visualization to help:

The Times of London published election guides throughout the 20th century including voting results and candidate bios for every constituency in every election to the House of Commons. We scanned and OCR'd seven volumes of this series and wrote scripts to extract information about each constituency race, including the name, vote total, and short bio of each candidate. The challenge then was to determine which appearances belonged to the same individual. For example, when "P G Agnew" runs in 1950 and "Peter Agnew" runs in 1955, are they the same person? We trained a clustering algorithm to do this matching based on name similarity, year of birth, party, and gender, and wrote some scripts to catch likely errors. When we thought we had done as well as we could, we decided to produce a little visualization to admire our perfectly cleaned data. To our surprise, the visualization revealed a number of hard-to-catch remaining errors.

Thursday, November 16, 2006

Database quality

Dr. Dobb's Portal writes about how few organizations check their databases, and why they should.