Depth Reporting

Showing posts with label Data. Show all posts
Showing posts with label Data. Show all posts

Thursday, July 10, 2008

Free book on data wrestling

Paul Murrell, a senior lecturer in statistics at New Zealand's University of Auckland, has published a working draft of his upcoming book, opaquely titled "Introduction to Data Technologies," online. There's also a PDF you can download. The book, written for academics but potentially useful to geeky, data-oriented journalists, discusses how to work with HTML, CSS, XML, databases, SQL and R. From the introduction:

The basic premise of this book is that scientists are required to perform many tasks with data other than statistical analyses. A lot of time and effort is usually invested in getting data ready for analysis: collecting the data, storing the data, transforming and subsetting the data, and transferring the data between different operating systems and applications.

Many scientists acquire data management skills in an ad hoc manner, as problems arise in practice. In most cases, skills are self-taught or passed down, guild-like, from master to apprentice. This book aims to provide a more structured and more complete introduction to the skills required for managing data.

The focus of this book is on computational tools that make the management of data faster, more accurate, and more efficient. The intention is to improve the awareness of what sorts of tasks can be achieved and to describe the correct approach to performing these tasks and there is an emphasis on working with data technologies via written computer languages.

[via Statistical Modeling, Causal Inference, and Social Science]

Friday, May 23, 2008

PolicyMap

image

PolicyMap is yet another Web site that gathers multiple sources of data and promises to make it easier to access and analyze. Unlike Numbrary and Infochimps, however, wants to make money doing it. PolicyMap offers three levels of service: Free; standard, for $2,000 a year; and premium, where prices range from $5,000 to a single user and $35,000 for 10 (the latter are the prices for governments and non-profits -- commercial users must call). Paying customers can upload their own data and have more mapping and reporting options. I haven't had time to play with it, but even the free service seems extensive, including access to "over 4,000 indicators related to housing, mortgage originations, demographics, crime, education, income, jobs, energy and taxes" and the ability to plot those datasets on thematic maps. They market themselves explicitly to media, quoting a Washingtonpost.com reporter on the site and boasting that "PolicyMap provides media professionals with quick access to reliable data in media-ready formats":

With PolicyMap, media can:

    * Generate maps and tables to incorporate in reports and articles
    * Search by address, city, state, zip code, county or census tract
    * Create topical reports by predefined region, radius, or custom-drawn region
    * Compare data across geographies or view trends over time

PolicyMap was "produced" by The Reinvestment Fund, "a progressive, results-oriented, socially responsible community investment group that today works across the Mid-Atlantic region." You can read more about PolicyMap on their blog.

[via Free Geography Tools]

Monday, May 19, 2008

Can you trust someone's conclusions if you can't reproduce their work?

In theory the strength of science is that work done in its name is reproducible and verifiable, but what does it say about the theory when in fact that's not really true?

Journals and granting agencies are prodding scientists to make their data public. Once the data is public, other scientists can verify the conclusions. Or at least that’s how it’s supposed to work. In practice, it can be extremely difficult or impossible to reproduce someone else’s results. I’m not talking here about reproducing experiments, but simply reproducing the statistical analysis of experiments.

It’s understandable that many experiments are not practical to reproduce: the replicator needs the same resources as the original experimenter, and so expensive experiments are seldom reproduced. But in principle the analysis of an experiment’s data should be repeatable by anyone with a computer. And yet this is very often not possible.

[via Statistical Modeling, Causal Inference, and Social Science]

Thursday, April 24, 2008

Datamob: "Public data put to good use"

Datamob "aims to show, in a very simple way, how public data sources are being used":

Our listings emphasize the connection between data posted by governments and public institutions and the interfaces people are building to explore that data.

It's for anyone who's ever looked at a site like MAPLight.org and wondered, "Where did they get their data?" And for anyone who ever looked at THOMAS and thought, "There's got to be a better way to organize this!"

The creators, Sean Flannagan and Lauren Sperber, say they have two broad goals:

  • Encourage governments and public institutions to make more data available in developer-friendly formats like CSV, XML and RDF. Widely accessible public data enables informed civic engagement, and we believe that providing restriction-free data to developers is the best way to promote the technological innovations that will spread knowledge.
  • Illuminate the process of creating interfaces, mashups and visualizations for public data, and inspire people to create new ones.

And this is how Sperber explains the name:

Well, the folks at Freebase coined the term "data mob" to describe a group of data-lovers working together to perfect a small portion of Freebase's ambitiously all-encompassing database. As for our Datamob, we hope it'll inspire more institutions with vast reserves of information to put their data out there in accessible formats—and bring together more data mobbers to bring that information to life.

[via]

Tuesday, April 15, 2008

Watchdog.net

... aims to "build a hub for politics on the Internet."

Our plan has three parts:

Data: There's a lot of great information out there about politics – district demographics, votes, lobbying records, campaign finance reports – but unfortunately it's split across a dozen different web sites and often hidden behind confusing interfaces. We're pulling all of that together and letting you explore it in one elegant, unified interface. (Plus, we're sharing all the results so you can come up with new ways to explore it.)

Action: Just giving you information isn't enough. Unless you can do something about it, it's just going to get you down. So we're building a series of first-class tools for getting involved – ways to write and call your representatives, send letters to local media, and figure out who to vote for.

Causes: But politics isn't about people doing things in isolation; it's about coming together around shared causes. That's why we let you start your own causes and campaigns, invite your friends to join them, and let you learn about other causes that could use your help.

The site is just getting started so there's not a lot to see yet (" ... we're building this site right before your eyes. So expect things to break, fix, appear, and disappear before your very eyes"), but it's backed by a grant from the Sunlight Network and its founder is Aaaron Swartz, co-founder of Reddit and creator of theinfo.org, mentioned here previously.

Watchdog.net is soliciting help of all kinds and making its source code and data available to all.

Monday, April 7, 2008

Whitepages.com opens phone and address data

Whitepages.com is making "virtually all" of its data -- including the data used to make people, reverse phone and reverse address searches -- available to programmers for free.

A press release says Whitepages.com has data on "nearly 180 million people which equals 80 percent of the U.S. adult population." There are also 25 million work listings.

My first thought was that this could prove useful for anyone doing database-driven investigative reporting because it would make it easier to identify people named in public record databases.

But I thought otherwise after reading the terms of use, which include this:

if you implement the API on a restricted web site, you shall provide the Company with a log-in name and password that will allow the Company to access the web site

And these:

(b) you shall not retain or store any Data for any reason;

(c) you shall not aggregate or otherwise combine Data from individual queries for any reason;

Queries are also limited to 1,500 per day. The site says its data can be used to create "consumer applications, Web sites, and mashups" but it can't be used to "create applications for business end-users." It's understandable why they'd do this: Presumably they want to drive traffic to their site from mashups built with their data, but don't want to give away the store. Nevertheless, it's disappointing.

I still wanted to try it out, though, so I signed up for an API key and did a simple test using PHP and Louisville's mayor as my test subject. The way it works is you feed the search terms via a URL and it returns XML with the results.

The code looked like this:

<?php

$url = "http://api.whitepages.com/find_person/1.0/?firstname=jerry;lastname=abramson;zip=40201;api_key=YOUR_API_KEY_HERE";

$xmlstr = file_get_contents($url);

// PHP's SimpleXML apparently can't handle elements with prefixes like wp: // as used by Whitepages.com, so we remove them from the xml $xmlstr = str_replace('wp:', '', $xmlstr); $xml = new SimpleXMLElement($xmlstr); foreach ($xml->listings->listing as $listing) { echo "Name: ", $listing->people->person->firstname, ' ', $listing->people->person->lastname, "\n"; echo "Business: ", $listing->business->businessname, "\n"; echo "Phone: ", $listing->phonenumbers->phone->fullphone, "\n"; echo "Address: ", $listing->address->fullstreet, "\n"; echo "Latitude: ", $listing->geodata->latitude, "\n"; echo "Longitude: ", $listing->geodata->longitude, "\n"; echo "Last validated: ", $listing->listingmeta->lastvalidated, "\n\n-----------\n\n"; }

?>

And produced this output:

Name: Jerry Abramson
Business: Louisville Science Center
Phone: (502) 560-7141
Address: 727 W Main St
Latitude: 38.257345
Longitude: -85.761902
Last validated: 03/2006

-----------

Name: Jerry Abramson
Business: City of Louisville Metro Government
Phone: (502) 574-5000
Address: 400 S 6th St
Latitude: 38.253456
Longitude: -85.760631
Last validated: 12/2006

-----------

Name: Jerry Abramson
Business:
Phone: (502) 897-6559
Address: 44 Eastover Ct
Latitude: 38.252427
Longitude: -85.677070
Last validated: 12/2007

-----------

Name: Jerry Abramson
Business: City of Lsvl Jfrsn Cnty Plc
Phone:
Address: 768 Barret Ave
Latitude: 38.240838
Longitude: -85.731823
Last validated: 12/2004

-----------

Thus by feeding Whitepages just a name and ZIP code, we get back organizations that may be related to our subject, as well as phone numbers, addresses, latitude and longitude for mapping and a date for when the data was last checked. This example doesn't show it, but this search also turned up the name of the mayor's wife.

Nice. Too bad there are so many restrictions.

[via]

Friday, March 21, 2008

Infochimps.org: "Free Redistributable Rich Data Sets"

"infochimps.org is a community to assemble and interconnect a giant free almanac, with tables on everything you can put in a table—things like a century of hourly weather, every major league baseball game, decades of stock prices, or every US patent filing":

Exploring rich data is fun, but inding it, formatting it, tagging it with metadata is drudge work barely fit for a trained chimp. And if you want to share a large raw dataset online, you face two troubling prospects: a) that no one will find it, or b) that everyone will find it.

A central, community-driven repository solves these problems, and also presents amazing possibilities. Interconnect the datasets along concepts they share: instead of 100,000 datasets, there’s just one. Study the physics of baseball by comparing the hourly weather during every single baseball game to game outcomes. Uncover political campaign irregularities by comparing neighborhood per-capita income, historical voter trends, and public campaign finance records. Plan real-estate decisions based on what news-and-other-media keywords rank highly in each area. If you’ve read Freakonomics, you know the power of this approach—let’s start building tools that make this way of thinking available to everychimp.

This is more than a little reminiscent of the Numbrary, mentioned here a few weeks ago.

[via]

Friday, March 14, 2008

Just how useful to terrorists is geographic data on the Web?

You may recall the wholesale pulling of information from the Web after 9/11. It was a remarkable example of institutional fear, simple-mindedness and the politically-sensitive bureaucrat's instinct for cosmetic solutions over meaningful ones. The RAND Corporation took an in-depth look at the dangers of putting geographic data on the Web and found that it posed little risk at all. The 2004 report, which I just came across, is called "Mapping the Risks: Assessing the Homeland Security Implications of Publicly Available Geospatial Information" (PDF):

  • Our analysis found that very few of the publicly accessible federal geospatial sources appear useful to meeting a potential attacker’s information needs. Fewer than 6 percent of the 629 federal geospatial information datasets we examined appeared as though they could be useful to a potential attacker. Further, we found no publicly available federal geospatial datasets that we considered critical to meeting the attacker’s information needs (i.e., those that the attacker could not perform the attack without).
  • Our analysis suggests that most publicly accessible federal geospatial information is unlikely to provide significant (i.e., useful and unique) information for satisfying attackers’ information needs. Fewer than 1 percent of the 629 federal datasets we examined appeared both potentially useful and unique. Moreover, since the September 11 attacks, these information sources are no longer being made public by federal agencies. However, we cannot conclude that publicly accessible federal geospatial information provides no special benefit to the attacker. Neither can we conclude that it would benefit the attacker. Our sample suggests that the information, if it exists, is not distributed widely and may be scarce.
  • In many cases, diverse alternative geospatial and nongeospatial information sources exist for meeting the information needs of potential attackers. In our sampling of more than 300 publicly available nonfederal geospatial information alternative sources, we found that the same, similar, or more useful geospatial information on U.S. critical sites is available from a diverse set of nonfederal sources. These sources include industry and commercial businesses, academic institutions, NGOs, state and local governments, international sources, and even private citizens who publish relevant materials on the World Wide Web. Some geospatial data and information that these nonfederal sources distribute are derived from federal sources that are publicly accessible. Similarly, these nonfederal organizations are increasingly becoming sources of geospatial data and information for various federal agencies (see Chapter Three for additional discussion). In addition, relevant information is often obtainable via direct access or direct observation of the U.S. critical site.
  • Incidentally, appendix B of the report gives a very comprehensive list of federal geospatial data sources on the Web, including the URLs. Just don't tell bin Laden.

    (via The FOI Advocate)

    Monday, March 10, 2008

    Congressional pay rates, 1789-2008

    Via beSpacfic, a Congressional Research Service report (PDF) giving Congressional pay rates since 1789. But there's no attempt to adjust for inflation, so you can't make any judgments about how well compensated legislators are now versus then. It also explains how Congressional salaries are set. The current "payable salary" from the report: $169,300. Good work if you can get it.

    Thursday, March 6, 2008

    UNdata

    ... has country-by-country data on agriculture, education, employment, energy,environment, industry, economics, population trade and tourism. The site, by the United Nations Statistics Division, says it has more than 55 million records and will be adding more.

    Tuesday, March 4, 2008

    Edward Tufte's Ask E.T. forum

    image

    Edward Tufte, "the Galileo of graphics," as a Business Week quote displayed prominently on Tufte's Web site labels him, regularly answers questions about information design. Current topics include "Sparklines: theory and practice," "Graphing Software," "Corrupt Techniques in Evidence Presentations" and "Mapping election results."

    Saturday, March 1, 2008

    Numbrary

    image

    The goal of the Numbrary is to be an online library for numbers. From its about page:

    It's hard to locate good numbers about anything.

    Here, try it yourself: How much money have the top 3 pharmaceutical firms spent on research & development over the last 5 years?

    Bet that took you a while. But these numbers are all in the public record — why should it take more than a few seconds to answer the question?

    You can contribute data to the site, suggest data for it to acquire and track updates via its blog.

    Tuesday, February 26, 2008

    Saving the American Time Use Survey

    A group is soliciting signatures to save the American Time Use Survey from being cut from the federal budget. They say the survey is "the most important new data initiative begun by the U.S government in at least 35 years":

    The ATUS provides essential information on how Americans spend their time, including time spent caring for children, cleaning the house, working for pay, and caring for sick adults. Put simply, the ATUS is needed to expand our horizons beyond merely charting where dollars go, to charting where time goes too. Statistics on spending, jobs, incomes, and so on are undeniably important. But anyone who wants to understand the changing lives of American families, to monitor the well-being of the American population, to measure national output, productivity and other outcomes that are essential to sound economic policy-making, or to make informed social policy decisions also needs information on how our population spends its time.

    Monday, February 25, 2008

    Open Government Data

    Every journalist should read the Open Government Data Principles at the Open Government Data wiki. The principles where developed at a meeting of 30 open government advocates in October:

    Government data shall be considered open if it is made public in a way that complies with the principles below:

    1. Complete
    All public data is made available. Public data is data that is not subject to valid privacy, security or privilege limitations.
    2. Primary
    Data is as collected at the source, with the highest possible level of granularity, not in aggregate or modified forms.
    3. Timely
    Data is made available as quickly as necessary to preserve the value of the data.
    4. Accessible
    Data is available to the widest range of users for the widest range of purposes.
    5. Machine processable
    Data is reasonably structured to allow automated processing.
    6. Non-discriminatory
    Data is available to anyone, with no requirement of registration.
    7. Non-proprietary
    Data is available in a format over which no entity has exclusive control.
    8. License-free
    Data is not subject to any copyright, patent, trademark or trade secret regulation. Reasonable privacy, security and privilege restrictions may be allowed.

    Compliance must be reviewable.

    And if that interests you, you should sign up for the Open Government mailing list. Its members include Carl Malamud, whose most recent project is putting all federal court documents online, and Aaron Swartz, who founded theinfo.org, a Web site devoted to dealing with large data sets. theinfo, mentioned here previously, also has a series of mailing lists, for getting, processing and viewing corpulent data.

    Saturday, February 23, 2008

    Derby DataTrack and Many Eyes: 2008 Kentucky Derby contenders, trainers and sires

    This weekend we released the latest version of Derby DataTrack, our database of potential contenders for the Kentucky Derby. I know we (read: I) can do a better job presenting this data, but I haven't yet figured out how. A while ago ManyEyes added a network visualization tool and a way of embedding their visualizations on any Web site, so I thought I'd give it a try:

    While this is intriguing, this isn't the solution, so if you have any thoughts on how we can do better that don't involve mastering Flash or Processing in a week, drop me a note.

    Friday, February 22, 2008

    The New York Times and "Playgrounds for Data"

    Cyberjournlaist.net pointed me to this article by Jared M. Spool at User Interface Engineering, who says the New York Times is "a leader in creating interactive modules to accompany their news stories, often yielding in an impressive and fun experience":

    Every organization sits on a ton of data. Making that data useful is a constant challenge for designers.

    By looking at what the NYTimes interactive team has done, often with very small time frames, we can see examples of what is possible. From their work, we can learn new ways of presenting complex information in fun and engaging ways.

    Spool also wrote about "Playful Data: 3 Inspiring Interactive Web Sites":

    It feels to us that we're just at the beginning of what will likely be a revolution in how we handle large data sets. The applications we're seeing now are just the tip of the iceberg. The real value will be when we see these types of playful data tools in almost every application we touch. For those of us who like to play with our data, we're about to have some real fun.

    Friday, February 15, 2008

    SchoolDataDirect.org

    image

    The sponsors of SchoolDataDirect include the Bill & Melinda Gates Foundation and the Council of Chief State School Officers. The site lets you browse and download a wealth of education data -- including test scores, finances and demographics down to individual schools -- and promises to always have the most current data available. You can also compare schools to each other -- not just on blunt measures like overall test scores, but also how well they perform with special needs kids, English language learners and kids with disabilities. Disappointing to me, however, is that it appears to prohibit newspapers from using the data en masse:

    If you are not associated with an academic institution or nonprofit organization you may only reproduce, distribute, display, or transmit de minimus amounts of Education Data on an infrequent basis and only for noncommercial purposes.

    Thursday, January 24, 2008

    EveryBlock, Heath Ledger and the Pothole Paradox

    EveryBlock launched yesterday.

    Only eight months after winning a $1.1 million grant in the Knight News Challenge, the non-profit site has gone live with detailed data for San Francisco, Chicago and New York. It promises to expand to more places in the future.

    EveryBlock gathers freely available data -- building permits, crime reports, new building permits, blog posts, restaurant inspections, news articles, Flickr photos, Yelp business reviews, missed connections from Craigslist -- and makes it easy to search and browse by neighborhood. As with everything associated with founder Adrian Holovaty, it is artfully done. It sets a standard all data-driven sites should aspire to.

    "Sigh, if only newspaper sites were as well organized as this…" Journalistopia said.

    Many, including Al's Morning Meeting, hail it as "the beginning of something big."

    I'm not so sure.

    For one, there are others plowing similar ground. There's outside.in, YourStreet, and Yahoo!'s Our City (which exists only in India, for the moment at least). Everyone wants to be local these days, including Google, where a search for pizza 40205 will get you a map, address, phone numbers, reviews, a menu and more.

    None of those sites, which have their own strengths, offer the rich public record data being mined by EveryBlock. But EveryBlock also leaves me a little … cold.

    It is data without context, perspective or meaning. A comment on MetaFilter put it this way:

    This is a great idea but it certainly won't replace local news coverage because there's no way to figure out what the politics of anything are.

    For example, lots of politics around building permits, liquor licenses, development, etc.: no way to know what any of the granular stuff really means. it's nice to know that a restaurant on my block applied for a liquor license: maybe I can go and stop them because I'm afraid of noise.

    but how do I know what the backstory is? how do I know who is who? how do I find anything interesting? how do I know if I'm having an impact?

    it's a great tool for an actual local reporter to find info needed for stories-- but it doesn't do the valuable thing that reporters do when they are working well, which is boil down all the boring shit and give you what you need to know when you need to know it.

    Here's another MetaFilter comment:

    Much of the other data has too much noise. That a local restaurant has just received a scheduled inspection is too low-level. I dont care. I do want to know maybe if a local restaurant has received an extremely bad report. So maybe the data needs to be filtered.

    And while Journalistopia found it praiseworthy, it added:

    It’s tough to put all of that data into context and provide more historical information such as a community’s history, landmarks and evolving story. For instance, having a highly detailed view of crimes in a neighborhood is really cool, but how does my neighborhood compare to another? How is crime in the neighborhood trending?

    How indeed?

    Holovaty has described conventional news as a "blob of text." "Newspapers need to stop the story-centric worldview," he wrote in 2006.

    Stop? I don't think so. Go beyond, maybe. I'm ostensibly a data person, but browsing raw data, while it can be worthwhile, isn't nearly as compelling to me as the sudden, unexplained death of a 28-year-old movie star.

    Holovaty's absolutely right that news organizations -- including my own -- haven't even begun to exploit the potential of structured data on the Web. But there's no evidence in the past, no evidence now, nor will there be any evidence in the future, that the way ahead for the news industry is to feed the world more raw data, however skillfully deployed.

    We've got too much of the stuff already. What we need is for it to be boiled down. Distilled. Made interesting.

    EveryBlock says that's its goal. It says it exists to answer the question: "What's happening in my neighborhood."

    For a long time, that's been a tough question to answer. In dense, bustling cities like Chicago, New York and San Francisco, the number of daily media reports, government proceedings and local Internet conversations is staggering. Every day, a wealth of local information is created -- officials inspect restaurants, journalists cover fires and Web users post photographs -- but who has time to sort through all of that?

    Our mission at EveryBlock is to solve that problem. We aim to collect all of the news and civic goings-on that have happened recently in your city, and make it simple for you to keep track of news in particular areas. We're a geographic filter -- a "news feed" for your neighborhood, or, yes, even your block.

    Just how compelling will its offerings be to most readers?

    Steven Johnson, a writer and one of the principals behind outside.in, called it the Pothole Paradox.

    The Pothole Paradox goes like this:

    1. Say you've got a particularly nasty pothole on your street that you've been scraping the undercarriage of your car against for a year. When the town or city finally decides to fix the pothole, that event is genuinely news in your world. And it is news that you'll never get from your local paper, or TV affiliate, or radio station.

    Obviously this is a great opportunity for a site like outside.in, where news of pothole repairs might easily trickle up from neighborhood bloggers. But it's not that simple, alas -- there's a flip side to the pothole paradox:

    2. News about a pothole repair just five blocks from your street is the least interesting thing you could possibly imagine.

    Johnson added:

    The other complication here is that the correct scale of hyperlocal news varies depending on the nature of the news itself. Pothole repair may die out beyond a few blocks, but many happenings -- crimes or political rallies or controversial real estate development -- reverberate more widely. Going local sometimes requires that you zoom in all the way to the block level, even all the way to the individual address. But sometimes you need to zoom out too.

    EveryBlock promises to keep adding new features, so we don't know what it will eventually become. But I don't see it appealing to the masses the way it is now. Knowing that there was a construction violation ("34627269N") issued for 35 East 32 Street on December 27, 2007 isn't likely to be interesting even to the people living next door at 37 East 32 Street.

    FAILURE TO POST DOT PERMIT FOR PLACING MATERIAL ON STREET.AT TIME OF INSPECTION SKIDS OF CMU &#034;CONCRETE MASONRY UNITS&#034; ARE STORED AT ROAD INFRONT OF 33 E 32 STREET.THE GC HAS STORED THIS MATERIAL ON THE STREET

    Gotcha. But just between you and me, did Heath Ledger live nearby?

    Friday, January 18, 2008

    Free neighborhood boundary map files from Zillow (with some strings attached)

    You can download them here:

    The Zillow data team has created a database of nearly 7,000 neighborhood boundaries in the largest cities in the U.S. And we'd like to share them with you! We're sharing these neighborhoods under a Creative Commons license to allow people to use and contribute to our growing database.

    Now comes the fine print: You are free to use the files in this database in applications as long as you attribute Zillow when you use it. You may also make your own changes to the database files and distribute them, as long as you provide them under the same kind of license and give Zillow attribution. The neighborhood shapes are available below, zipped up in the Arc Shapefile format.

    Free Geography Tools notes that coverage is still limited, but Zillow is encouraging contributions and will incorporate them in their files if they prove accurate.

    Official Statistics on the Web

    Official Statistics on the Web, or OFFSTATS, from the University of Auckland Library, points you to free statistics from official sources online. Here's the section for the United States and here's Wallis & Futuna. You can search by country, region or topic. The site notes that it points to current data that is often downloadable as text or spreadsheet files.