…That’s a message I tweeted back in April 2010, when it was announced that
the Library of Congress was planning to archive
every publicly available tweet ever posted on the social network. Now,
almost three years later, the Library’s Twitter archive is beginning to take
shape, and there are clues as to what uses researchers will derive from all of
our 140-character witticisms.
When the Library initially took on the Twitter archive in 2010, it was
already a daunting 21 billion tweets filled with words, hashtags, geolocation
info, and other metadata. Today the Library has access to more than 170 billion
tweets or about 85 terabytes of data. With about half a billion tweets now
flowing into the archive daily, the biggest immediate challenge is finding a way
to make all this information coherent and usable.
“One of the things that makes this collection a little bit different is the
velocity with which it’s growing,” says Gayle Osterberg, director of
communications for the Library of Congress. “The computing capacity to search
for an item or a series of items across billions and billions of tweets isn’t
cost-effective at the present time for a public institution.”
Osterberg says the costs associated with the project, in terms of developing
the infrastructure to house the tweets, is in the low tens of thousands of
dollars. The tweets were offered as a free gift from Twitter, and are being
transferred to the Library through a separate company, Gnip, at no cost. Each
day tweets are automatically pulled in from Gnip, organized chronologically and
scanned to ensure they’re not corrupted. Then the data are stored on two
separate tapes which are housed in different parts of the Library for security
reasons.
The Library has mostly figured out how to make the archive organized, but
usability remains a challenge. A simple query of just the 2006-2010 tweets
currently takes about 24 hours. Increasing search speeds to a reasonable level
would require purchasing hundreds of servers, which the Library says is
financially unfeasible right now. There’s no timetable for when the tweets might
become accessible to researchers.
“The goal would be to be able to answer whatever query a researcher might
have here at the Library in our reading room,” Osterberg says. “The balance is
making the access both meaningful and cost-effective for the Library.”
While you can’t yet make a trip to Washington D.C. and have casual perusal of
all the world’s tweets, the technology to do exactly that is readily
available—for a cost. Gnip, the organization feeding the tweets to the Library,
is a social media data company that has exclusive access to the Twitter
“firehose,” the never-ending, comprehensive stream of all of our tweets.
Companies such as IBM pay for Gnip’s services, which also include access to
posts from other social networks like Facebook and Tumblr. The company
also works with academics and public policy experts, the type of people likely
to make use of a free, government-sponsored Twitter archive when it comes to
fruition.
Through Gnip, researchers have already made extensive use of much of the
Twitter archive. Sherry Emery, a senior scientist at the Institute for Health
Research and Policy at the University of Illinois at Chicago, analyzes tweets
about smoking to understand the
role of media in influencing the habit. When the Center for Disease Control
launched a graphic anti-smoking ad campaign
last spring, Emery and her team were able to analyze every public tweet
about smoking and understand how people were reacting to the
commercials.
“We can’t use Twitter to look at whether they actually quit smoking,” Emery
says. “But we can really get a better understanding of whether people embraced
the message of the ad or not, which is an important intermediate step to making
changes in their behavior.”
Even with the tools available to quickly search the Twitter archive, making
sense of such a huge dataset can be a challenge. Emery’s group has amassed more
than 50 million tweets about smoking since December 2011. “A big part of what we
do is just cleaning the data to make sure that the tweets that we’re looking at
are about smoking tobacco and not about smoking weed or smoking ribs or smoking
hot girls,” she says.
Using computer software to assess human emotion in tweets is also tricky. For
example, someone tweeting “This is scary!” about the CDC commercial featuring
former smokers with artificial voice boxes might seem like a negative reaction
to a computer program. In fact, it’s the desired effect for an ad aimed at
curbing smoking. Human coders have to feed the computer between 500 and 1,000
sample tweets to help it properly understand how to organize responses to a
research question.
Other uses for tweets have also emerged. Daniel Hodd, a business
student at Fordham University, is studying the conversation surrounding 50
stocks on Twitter to see if investor sentiment correlates with stock price.
Chris Cantey, a master’s student studying cartography at the University of
Wisconsin, has already used the more limited Twitter API system to
geographically map last month’s flu
outbreak. He’s now using the full firehose to analyze how responses to Hurricane Sandy unfolded in
real time.
All the researchers agree that Twitter is a powerful tool for
sociological study. Soon, if the Library of Congress can make its database fully
functional, it’ll also be an easily accessible one. And one day, long after
we’ve all sent our final snarky tweet, our messages will live on.
“Social media gives even those among us who don’t have the time to pick up a
pen every day an opportunity to be recorders of and witnesses to history,” says
Osterberg. “Those perspectives will be incredibly valuable to researchers and
authors and policy makers down the road who want to understand the times we’re
living in today.”
Source: business.time.com
Δεν υπάρχουν σχόλια:
Δημοσίευση σχολίου