In a 2008 article, John Herbert and Karen Estlund wrote that newspaper digitization was “exploding” 1 As leaders of the Utah Digital Newspapers project, they’re probably the right people to ask. The project was a poster child for digitization and open access for years, snapping up grant funding and positive press. Even if it was sometimes overshadowed by Google’s newspaper project, it’s still a strong example of a digitization effort that makes an otherwise buried resource available to the world. Herbert and Estlund’s paper is heavy on technical insights, like a comparison between digitizing from paper and microfilm. 2 But there’s a hint, near the end, that might have been the most prophetic piece of the article: until OCR technology gets closer to 100% accuracy, the project depends on two human reviewers typing up every masthead, on every page, in every newspaper. 3
As an article from the Cambridge Digital Humanities Network puts it, there is no small number of “tasks which computers cannot yet do effectively,” and crowdsourcing has become a popular way to invite the public into a project and fill in those gaps in the digital record. Examples like the protein-research game FoldIt or the many initiatives on Zooniverse connect science-minded internet users to research projects, and digitization projects like the Washing State digital archives allow online guests to help transcribing handwritten records. Duolingo is another successful example, offering both a language-learning tool and a way to crowdsource translations. The mob, in B.F. Skinner’s quote, “rushes in where individuals fear to tread.” 4
There’s a joke about the wisdom of crowds in here, somewhere. As Larry Cebula points out in a blog post on the Library of Congress project to publish its photo collection on Flickr, a global network of bored strangers isn’t always an ideal partner. For every helpful contribution in the Flickr pool, there are a dozen dumb jokes to bury it. “As a way to add useful metadata to historical documents,” Cebula concludes, “the Library of Congress Flickr Pilot Project is a disappointment.” Smaller projects have seen more success, but high-profile examples should deflate the fantasy that crowdsourcing will magically improve online collections all by itself.
It’s hard to talk about these things without talking about the commoditization and de-skilling of crowdsourced labor. The public may have a lot to offer research projects, but the valuation seems one-sided; At the reducto ad absurdum of the crowdsourced web, human expertise is just one more pool of ‘content’ to be repackaged and resold. It’s wonderful when “a small core of highly committed volunteers” 5 have structures in which to collaborate–it’s a vital piece of the internet’s best potential. But as Google’s newspaper project has amply demonstrated, public and private interests are difficult to balance in digital projects.
From my own experience with Scribe, the Washington state digital archives’ crowdsourcing tool, “assistant to the OCR system” is a strange volunteer job. A great deal of the records in the state archives are handwritten documents that software doesn’t (presently) decode very well. Scribe lets volunteers read the records and transcribe them into the appropriate database fields. But you don’t have to go through too many records before you discover that handwriting isn’t the only issue with these documents: “required” information goes missing, the actual fields on the form can mismatch with the data-entry fields, and notations in the margins of forms can complicate the records’ meanings. These are all places where human attention to the records could improve the data, but they all fall outside the very narrow task that Scribe puts in front of you. The “crowd” is just one cog in this machine.
There are other models of collaboration on (and off) the internet, and I think we should hold crowdsourcing to a high standard. Open-source software could be one comparison, and wikis are another. These set a higher bar for entry than Flickr does, but they also create a space for creativity and expertise to be expressed in more constructive ways. And generally, they hold to stricter principles of who “owns” volunteers’ contributions, too. Crowdsourcing is an exciting idea. It’s important enough that we ought to get it right, and there’s clearly a long way to go.
- Herbert and Estlund, Creating Citizen Historians, 333 ↩
- Herbert and Estlund, 338 ↩
- Herbert and Estlund, 340 ↩
- B.F. Skinner, Walden Two ↩
- Digital Humanities Network, Crowdsourcing