Τρίτη, Ιανουαρίου 08, 2013
In the few minutes it will take you to read this story, some 3 million new tweets will have flitted across the publishing platform Twitter and ricocheted across the Internet. The Library of Congress is busy archiving the sprawling and frenetic Twitter canon — with some key exceptions — dating back to the site’s 2006 launch. That means saving for posterity more than 170 billion tweets and counting, with an average of more than 400 million new tweets sent each day, according to Twitter.
But in the two years since the library announced this unprecedented acquisition project, few details have emerged about how its unwieldy corpus of 140-character bursts will be made available to the public.
That’s because the library hasn’t figured it out yet.
“People expect fully indexed — if not online searchable — databases, and that’s very difficult to apply to massive digital databases in real time,” said Deputy Librarian of Congress Robert Dizard Jr. “The technology for archival access has to catch up with the technology that has allowed for content creation and distribution on a massive scale. Twitter is focused on creating and distributing content; that’s the model. Our focus is on collecting that data, archiving it, stabilizing it and providing access; a very different model.”
Colorado-based data company Gnip is managing the transfer of tweets to the archive, which is populated by a fully automated system that processes tweets from across the globe. Each archived tweet comes with more than 50 fields of metadata — where the tweet originated, how many times it was retweeted, who follows the account that posted the tweet and so on — although content from links, photos and videos attached to tweets are not included. For security’s sake, there are two copies of the complete collection.
But the library hasn’t started the daunting task of sorting or filtering its 133 terabytes of Twitter data, which it receives from Gnip in chronological bundles, in any meaningful way.
“It’s pretty raw,” Dizard said. “You often hear a reference to Twitter as a fire hose, that constant stream of tweets going around the world. What we have here is a large and growing lake. What we need is the technology that allows us to both understand and make useful that lake of information.”
For now, giving researchers access to the archive remains cost-prohibitive for the cash-strapped library, which has spent tens of thousands of dollars on the project so far, Dizard says. Like many federal agencies, the Library of Congress has been hit by budget cuts in recent years. Without a major overhaul to its computing infrastructure, it isn’t equipped to handle even the simplest queries.
“We know from the testing we’ve done with even small parts of the data that we are not going to be able to, on our own, provide really useful access at a cost that is reasonable for us,” Dizard said. “For even just the 2006 to 2010 [portion of the] archive, which is about 21 billion tweets, just to do one search could take 24 hours using our existing servers.”
Instead, the library is exploring whether it might be able to afford to pay a third party to provide public access to the archive. But for those who have immediate research interests — and many people have contacted the library, Dizard says — the wait is maddening.
Gnip President Chris Moody says he’s used to serving clients like major corporations and political campaigns that expect data right away.
“Milliseconds is not uncommon for expected latency from when the tweet happened to when someone would be able to get it and analyze it,” he said.
Even after questions of access are resolved, Moody says he expects centuries to pass before the full value of the Twitter archive can be realized.
“We’re very, very early,” Moody said. “We’re 1 percent of the way into what this data will mean.”
The eventual plan is to make the collection available only within the Library of Congress reading rooms. Requiring an in-person visit to search a database of material that originated online may seem incongruous, but Dizard says it’s a condition of the deal with Twitter, which gifted the archive, so that the library won’t be “competing with the commercial sector.”
There are other limitations. The library is not archiving tweets from those who opt for the strictest privacy settings, which allow Twitter users to approve or reject each potential follower. The library is also planning to scrub deleted tweets, meaning the public won’t have access to posts that were published but later removed. Dizard, citing privacy concerns, calls that decision “one of the more significant policy questions we face.”
In its terms of service, Twitter says that the default is “almost always to make the information you provide public for as long as you do not delete it from Twitter.”
Moody says it follows that deleted tweets are off-limits.
“Twitter’s terms of service are quite clear,” Moody said. “Any organization that accesses Twitter data through us, and this certainly applies to the Library of Congress as well, has to comply with the terms of service.”
The tension lies in the historical value of seeing what a person publishes, then erases. The sexually suggestive tweet that led to Rep. Anthony Weiner’s resignation is one of the splashiest examples of how deleted tweets can be significant, but even seemingly mundane deletions could carry weight with the passage of time.
The nonprofit Sunlight Foundation has a site, called Politwoops, which culls politicians’ deleted tweets in a searchable collection. Tom Lee, the director of the foundation’s Sunlight Labs, says he finds it bizarre that the library’s archive would exclude such tweets.
“You can’t make a TV appearance or press release or speech just disappear,” Lee said. “It’s not clear to me why someone should be allowed to remove something from the public record.”
A Twitter spokesman wouldn’t say whether the site has considered retroactively making public deleted tweets the way some government files are declassified after a certain number of years. But there are people within the Library of Congress who argue that deleted tweets ought to be part of the archive, Dizard says. The debate is unlike anything the library has had to consider in its 212-year existence.
“You could look at it strictly and say that anybody who puts a tweet up has published it,” Dizard said. “We have never received a collection that has ownership transferred through a click-through agreement, so that’s the difference. Most of our collections come with signed agreements or purchase. This is a different way of acquiring.”
The Twitter archive also signals a shift in how the library sees what kinds of acquisitions are possible. The library will continue amassing physical objects such as personal papers, books, maps and copyright registrations. But Dizard says the Library of Congress is also exploring how to acquire other more ephemeral trappings of the digital realm, such as Google-search histories, for example.
“Search-engine requests are a very, very good indication of what people are thinking,” Dizard said. “The acquisition of the Twitter archive better prepares us and encourages us to take other aspects of social media and digital content. That’s something we’ll have to do. . . . The acquisition of the Twitter archive is a start for us, not a test about whether we want to continue or not. This is really a critical part of the mission of the library.”