Saturday, March 29, 2008

A cure for Web flatulence

It's lovely that so many people have so many wonderful ideas for what newspapers should do to save themselves, but I'm tired of reading about them. Why am I tired? Because the people who write about these almost never offer the information you need to evaluate their true worth. Newspaper print ad revenues plunged farther last year than in the any of the 50+ years since such measurements began. That's why people are being laid off. That's why investigative reporting teams are being shut down. That's why almost no one wants to bid when a newspaper goes on sale. The numbers are bad, really bad.

So if you're going to tell us about your great online project and how it's going to help reverse this trend, you've got to give us some numbers too. You've got to give us the information we need to fairly evaluate it -- as a business proposition. You need to give us something more solid than Web flatulence to decide whether your idea is something the news industry can build profitable businesses around.

We're told we need to build data centers. We're told we need to build narrowly targeted Web sites serving niche markets. We're told we need to go hyperlocal. We're told we need to crowdsource. We're told we need to deploy mobile journalists. We're told we need to nurture citizen journalists. We're told we need to engage readers in conversations. We're told we need to become link aggregators. We're told we need to do podcasts. We're told we need to do video. We're told we need to provide feeds for everything we do. We're told we need to spew text messages, Twitter and build widgets on Facebook. We're told we need to do continuous updates online, 24/7.

Fine. Those are all good ideas. But if you've done it, what were the results? Show us the numbers. Give us a fair and honest evaluation of how you did against your competition, however defined.

These are the kinds of questions I want answered for all online news projects, large and small:

  • How many page views did your project generate? How many unique visitors? How long did they stay on the site? Where did they come from? Are they coming back?
  • How does that compare with other things you've done?
  • Do you have any advertisers for this? Who are they? How much are they paying? Is it generating any other revenue? How much? If it isn't generating any money, why not?
  • How much did it cost to make this? How long did it take? How many people were involved? What didn't you do in the meantime?
  • Did you make a profit? Did you even try to measure whether it's profitable? How do you evaluate whether it's successful?
  • Is it easily repeatable? In other words, is it a strategy that can be adopted by any news organization, at any time, or does it require unique, hard-to-find skills? Can you keep it going if the creator quits?
  • Who else is doing this? How successful are they? Do they do it better than you? How easy is it for competitors to duplicate what you've done?
  • What mistakes did you make? What didn't work and why? What would you do differently next time?

Of course most us won't answer most of these questions publicly, either because our employers won't let us, or because we don't know the answer, or because it's not our department, or because it's embarrassing, or because we just want to do what we do because we can and it's cool and it's fun. I get that.

But that's what I want to know.

Friday, March 28, 2008

PDFTextOnline

image

PDFTextOnline offers "Hassle-Free PDF Text Extraction in Your Browser":

Getting text and other content out of your PDF documents is often a hassle. Adobe Acrobat™ (or your other favorite PDF viewer) can do copy-and-paste, but that's time-consuming and tedious for anything but the smallest jobs. Acrobat™ also has a 'save as text' option, but unless you spring for Acrobat™ Professional, it often generates inaccurate text and simply cannot cope with some languages (especially Chinese, Japanese, and Korean).

Your other options include Adobe's online text conversion tools (which make you wait for an email to get the converted PDF content), or one of the dozens of utilities swarming around the Internet that require you to download, install, and then hope that they won't spray viruses around your computer.

I gave it a try on some PDFs and it was impressively fast and converted the PDFs to text cleanly. But unfortunately, it still isn't helpful enough with the PDFs that truly vex me, like this one from our court system.  Its neat and orderly tables of data look like they would be easy to convert to text and import into a spreadsheet, but in fact doing so is an incredible PITA because those neat and orderly tables collapse into a difficult to parse jumble when converted to text. Usually I resort to begging the courts to give it to me in Excel (which can take days, if they'll agree to do it at all)  or using Perl and regular expressions. The PDFTextOnline text was very clean, and appeared as good if not better than other conversion tools I've tried, but still would require to work to put into Excel or a database for analysis.

[via NICAR-L and Neil Reisner]

Monday, March 24, 2008

Mass layoff data

The Bureau of Labor Statistics tracks mass layoffs. This includes data for individual states.

[via beSpacific]

Chauncey Bailey Project

Journalists who have banded together to finish the work of an Oakland journalist murdered last year have a Web site:

New America Media and the Maynard Institute have convened an array of Bay Area journalists, as well as highly respected media organizations and local university journalism departments to form an investigative team to honor and continue the work of journalist Chauncey Wendell Bailey Jr., and answer questions regarding his death. Bailey, the editor of the weekly Oakland Post, was murdered on Aug. 2 while reporting on a story regarding the suspicious activities of the Your Black Muslim Bakery.

In an unusual collaboration, more than two dozen reporters, photographers and editors from print, broadcast and electronic media, and journalism students are launching the Chauncey Bailey Project - an investigative unit that will continue and expand on the reporting Bailey was pursuing when he was gunned down. Devaughndre Broussard, 19, a handyman for Your Black Muslim Bakery, has confessed to the crime, according to police, but many questions about the possible motive for the killing have yet to be answered.

The most famous such effort was the "Arizona Project" on behalf of Don Bolles, a journalist murdered in 1976 while reporting on organized crime.

Saturday, March 22, 2008

Microsoft Excel as a 3D game engine

This is either an advanced form of psychosis or genius: This guy demonstrates how to use Excel to make animated 3D graphics:

Integration of computer games and spreadsheets has tightened during the evolution of computer technology.

At an early stage this integration among the the games and spreadsheets was comical, e.g. they were installed on the same hard disk, or the purchased games were listed in an Excel sheet. Later the integration has tightened, as some games introduced a built-in spreadsheet (accessible by the "boss key" feature) - or Excel contained some built-in 3D games as Easter Eggs.

Now we have arrived at the next step of this integration, as Excel's cutting-edge 3D functionality is not hidden in Easter Eggs anymore but can be accessible publicly and easily. Excel has grown up and started its conquest as a revolutionary 3D game engine.

(When he talks about Easter Eggs, by the way, he's not talking about what the bunny will pass out Sunday. He's taking about secret features hidden in software.)

[via]

Friday, March 21, 2008

Infochimps.org: "Free Redistributable Rich Data Sets"

"infochimps.org is a community to assemble and interconnect a giant free almanac, with tables on everything you can put in a table—things like a century of hourly weather, every major league baseball game, decades of stock prices, or every US patent filing":

Exploring rich data is fun, but inding it, formatting it, tagging it with metadata is drudge work barely fit for a trained chimp. And if you want to share a large raw dataset online, you face two troubling prospects: a) that no one will find it, or b) that everyone will find it.

A central, community-driven repository solves these problems, and also presents amazing possibilities. Interconnect the datasets along concepts they share: instead of 100,000 datasets, there’s just one. Study the physics of baseball by comparing the hourly weather during every single baseball game to game outcomes. Uncover political campaign irregularities by comparing neighborhood per-capita income, historical voter trends, and public campaign finance records. Plan real-estate decisions based on what news-and-other-media keywords rank highly in each area. If you’ve read Freakonomics, you know the power of this approach—let’s start building tools that make this way of thinking available to everychimp.

This is more than a little reminiscent of the Numbrary, mentioned here a few weeks ago.

[via]

Document Contrast Diagrams

Neoformix explains a Document Contrast Diagram comparing the president's 2007 and 2008 State of the Union speeches:

A Document Contrast Diagram is a visual summary of the content of two text documents that illustrates shared words, words that are unique to one document or the other, word frequency, relative size of the two documents, distribution of emotional tone within the documents, related words based on co-occurence, and the most common word in each document segment.

No explanation of how to create such a thing, though.

New York Public Radio on deception

New York Public Radio devoted a recent broadcast to deception.

We look at lies, liars, and lie catchers, and ask: can you lead a life without deception? We consult a cast of characters, from pathological liars to lying snakes to drunken psychiatrists, to try and understand the dark trait of deception.

You can download the entire episode as an MP3. I, of course, have never lied, so the broadcast is of little relevance for anyone who wants to understand me.

[via]

Thursday, March 20, 2008

First Lady Hillary Rodham Clinton's Daily Schedules

... are now available online. As The New York Times reports, they were made available in response to a Freedom of Information request and a lawsuit. It will be interesting to see what the blathersphere makes of these. Unfortunately, not only are they PDFs, which makes extracting useful information difficult, but 4,746 of the pages have been censored. Says the Times:

The dry records carry all the emotional punch of a factory worker’s time card, showing where she was for much of her eight years in the White House but telling nothing about what she was saying, thinking or doing.

[via]

Wednesday, March 19, 2008

Tuesday, March 18, 2008

People finder white paper

The indefatigable Marcus P. Zillman has updated his white paper on people finder resources. You can download it as a 21-page PDF.

The Public Record Research TIPS BOOK

... from Facts on Demand Press is $19.95 and is scheduled to be published at the end of this month:

Learn first hand how to use “Insider Information” for searching for public records at thousands of government public record agencies and web pages. This resource provides the tips and practical knowledge to guide you to the right source and help you become an ultra-efficient searcher.

  • Field Guide to Court Record Databases
  • How to Evaluate Public Record Vendor
  • How to Evaluate Record Search Sites
  • Why and When All Criminal Records are Not Created  Equal
  • Expanded Coverage of Government Watch Lists, Sanction Lists, and Enforcement Actions
  • Know the Location Anomalies that Affect Searching Liens and Assets

Monday, March 17, 2008

Raising Franken-measures from the dead

image

Juice Analytics explains "Franken-measures", "a made-up metric monster that creates a comprehensive composite to capture complex concepts":

Franken-measures go by many names — indexes, scales, ratings, composite or compound measures — and show up in all sorts of places:

The Courier-Journal's Litkenhous Ratings (which I have no part in) are a kind of Franken-measure.

Text messages are increasingly being treated as public records

... reports USA Today:

Those supposedly private messages that public officials dash off on their government cellphones to friends and colleagues aren't necessarily private after all.

Courts, lawyers and states are increasingly treating these typed text messages as public documents subject to the same disclosure laws — including the federal Freedom of Information Act — that apply to e-mails and paper records.

"I don't care if it's delivered by carrier pigeon, it's a record," said Charles Davis, executive director of the National Freedom of Information Coalition at the University of Missouri. "If you're using public time or your public office, you're creating public records every time you hit send."

Friday, March 14, 2008

Just how useful to terrorists is geographic data on the Web?

You may recall the wholesale pulling of information from the Web after 9/11. It was a remarkable example of institutional fear, simple-mindedness and the politically-sensitive bureaucrat's instinct for cosmetic solutions over meaningful ones. The RAND Corporation took an in-depth look at the dangers of putting geographic data on the Web and found that it posed little risk at all. The 2004 report, which I just came across, is called "Mapping the Risks: Assessing the Homeland Security Implications of Publicly Available Geospatial Information" (PDF):

  • Our analysis found that very few of the publicly accessible federal geospatial sources appear useful to meeting a potential attacker’s information needs. Fewer than 6 percent of the 629 federal geospatial information datasets we examined appeared as though they could be useful to a potential attacker. Further, we found no publicly available federal geospatial datasets that we considered critical to meeting the attacker’s information needs (i.e., those that the attacker could not perform the attack without).
  • Our analysis suggests that most publicly accessible federal geospatial information is unlikely to provide significant (i.e., useful and unique) information for satisfying attackers’ information needs. Fewer than 1 percent of the 629 federal datasets we examined appeared both potentially useful and unique. Moreover, since the September 11 attacks, these information sources are no longer being made public by federal agencies. However, we cannot conclude that publicly accessible federal geospatial information provides no special benefit to the attacker. Neither can we conclude that it would benefit the attacker. Our sample suggests that the information, if it exists, is not distributed widely and may be scarce.
  • In many cases, diverse alternative geospatial and nongeospatial information sources exist for meeting the information needs of potential attackers. In our sampling of more than 300 publicly available nonfederal geospatial information alternative sources, we found that the same, similar, or more useful geospatial information on U.S. critical sites is available from a diverse set of nonfederal sources. These sources include industry and commercial businesses, academic institutions, NGOs, state and local governments, international sources, and even private citizens who publish relevant materials on the World Wide Web. Some geospatial data and information that these nonfederal sources distribute are derived from federal sources that are publicly accessible. Similarly, these nonfederal organizations are increasingly becoming sources of geospatial data and information for various federal agencies (see Chapter Three for additional discussion). In addition, relevant information is often obtainable via direct access or direct observation of the U.S. critical site.
  • Incidentally, appendix B of the report gives a very comprehensive list of federal geospatial data sources on the Web, including the URLs. Just don't tell bin Laden.

    (via The FOI Advocate)

    Thursday, March 13, 2008

    "Agencies run more than a decade late on Freedom of Information requests"

    So reports The Hill:

    The Energy Department has the tardiest public record request, according to a review by The Hill of annual FOIA reports published by Cabinet-level agencies for the last fiscal year. It still has not answered one request from Dec. 6, 1991, although other departments are not far behind.

    The Defense Department has a request pending from May 5, 1992, while the Treasury Department has not answered a request from March 8, 1993.

    (via Michael Ravnitzky on FOI-L)

    Wednesday, March 12, 2008

    HTML Reference

    Do most journalists need to learn HTML? I'm not convinced they do, much less Javascript. Nevertheless, it wouldn't hurt -- and Sitepoint has released a new HTML reference that they say they've "worked hard to make this the most detailed and up-to-date reference on the subject available." They already have one for CSS -- and promise one for Javascript, in case I'm wrong and you're ruining your career by not mastering this stuff. (via)

    Monday, March 10, 2008

    Emergency management bibliography

    Here's a guarantee if you're a news reporter: There will be a disaster (probably several) in your coverage area at some time in your career and you will be called upon to evaluate how well the emergency responders responded. Save this 750-page bibliography (PDF) to help you in your research when that time comes. (via Resourceshelf, which calls the bibliography "Awesome" and explains why many entries are highlighted in yellow)

    .

    Congressional pay rates, 1789-2008

    Via beSpacfic, a Congressional Research Service report (PDF) giving Congressional pay rates since 1789. But there's no attempt to adjust for inflation, so you can't make any judgments about how well compensated legislators are now versus then. It also explains how Congressional salaries are set. The current "payable salary" from the report: $169,300. Good work if you can get it.

    Thursday, March 6, 2008

    White Collar Crime Prof Blog: How Not to Ask Questions

    Asking clear and direct questions is as important for journalists as it is for prosecutors. The White Collar Crime Prof Blog points out the inadequate questioning revealed in the recently unsealed Barry Bonds grand jury testimony. Many a hemming and hawing journalist can relate to the poor prosecutors criticized here:

    While the indictment presents Bonds in a bad light by isolating specific instances of allegedly false answers, skimming through the full transcript shows just how disorganized the prosecutors seemed to be, and how at least one of them couldn't ask a simple question. Whether it was nervousness or perhaps being intimidated by Bonds, the questions come across almost like a stream of consciousness approach to the examination. Here's just one example of the kind of questions Bonds faced: "Let me ask the same question about Greg at this point, we'll go into this in a bit more detail, but did you ever get anything else from Greg besides advice or tips on your weight lifting and also the vitamins and the proteins that you already referenced?" (Pg. 23) Huh? Understanding that a transcript does not necessarily convey the full flavor of the actual interchanges, in reading through the questioning I'm struck by how convoluted the questions are, punctuated throughout with "I mean," "you know," and similar distracting phrases.

    What makes perjury so difficult to prove is that the allegedly false answer is not necessarily the most important thing. As the Supreme Court noted in Bronston v. United States, 409 U.S. 352 (1973), "Precise questioning is imperative as a predicate for the offense of perjury." Among the questions recited in the original indictment was this model of obfuscatory inquiry: "So, I guess I got to ask the question again, I mean, did you take steroids? And specifically this test the [sic] is in November 2000. So I'm going to ask you in the weeks and months leading up to November 2000 were you taking steroids . . . or anything like that?"

    If the answer is important enough, you should always ask the same question in multiple ways. You don't have to look hard to find examples where prevaricating politicians -- from the non-denial denials of John Mitchell during the Watergate era to Bill Clinton parsing the meaning of is -- stopped short of lying but failed to give fully honest answers. You have to pin those squirming insects to the board.

    UNdata

    ... has country-by-country data on agriculture, education, employment, energy,environment, industry, economics, population trade and tourism. The site, by the United Nations Statistics Division, says it has more than 55 million records and will be adding more.

    Wednesday, March 5, 2008

    Omnibiography.com: A directory of biographies

    image

    Omnibiography.com links to biographies on other sites, such as Wikipedia, the Rock and Roll Hall of Fame and the National Museum of Women in the Arts. They claim to be the most complete such directory on the Internet, with information on more than 100,000 people. I didn't find anything that indicates they vet biographies they link to for accuracy or completeness, so as always, be cautious with what you find.

    Tuesday, March 4, 2008

    Firing Line Television Program Collection

    image

    The Hoover Institution maintains a database of 1,504 episodes of the television show Firing Line, which was moderated by William F. Buckley Jr., who died last week. You can view some clips from the TV show online. The first link I clicked on randomly turned out to be a 1969 video of Buckley chatting with Billy Graham. Buckley slyly introduces Graham by saying that Graham "believes that Richard Nixon is an act of God." (Graham, not incidentally, apologized a few years ago after recordings were released of anti-Semitic remarks he made to Nixon). Fascinating stuff, whatever your politics.

    Edward Tufte's Ask E.T. forum

    image

    Edward Tufte, "the Galileo of graphics," as a Business Week quote displayed prominently on Tufte's Web site labels him, regularly answers questions about information design. Current topics include "Sparklines: theory and practice," "Graphing Software," "Corrupt Techniques in Evidence Presentations" and "Mapping election results."

    Monday, March 3, 2008

    Illustrating quantity

    image

    A bar chart is empty of emotion, with the number 3 delivering the same impact as 300 million. That isn't true of this series of images by Chris Jordan called "Running the Numbers: An American Self-Portrait":

    This series looks at contemporary American culture through the austere lens of statistics. Each image portrays a specific quantity of something: fifteen million sheets of office paper (five minutes of paper use); 106,000 aluminum cans (thirty seconds of can consumption) and so on. My hope is that images representing these quantities might have a different effect than the raw numbers alone, such as we find daily in articles and books. Statistics can feel abstract and anesthetizing, making it difficult to connect with and make meaning of 3.6 million SUV sales in one year, for example, or 2.3 million Americans in prison, or 410,000 paper cups used every fifteen minutes. This project visually examines these vast and bizarre measures of our society, in large intricately detailed prints assembled from thousands of smaller photographs. The underlying desire is to emphasize the role of the individual in a society that is increasingly enormous, incomprehensible, and overwhelming.

    Saturday, March 1, 2008

    Numbrary

    image

    The goal of the Numbrary is to be an online library for numbers. From its about page:

    It's hard to locate good numbers about anything.

    Here, try it yourself: How much money have the top 3 pharmaceutical firms spent on research & development over the last 5 years?

    Bet that took you a while. But these numbers are all in the public record — why should it take more than a few seconds to answer the question?

    You can contribute data to the site, suggest data for it to acquire and track updates via its blog.