Meformers or Mesearchers? On Finding Self-Centeredness in Social Media

The Miami Herald, this week, resurrected the preliminary report on a study of Twitter produced by some communications researchers at Rutgers back in September: "Is it Really About Me? Message Content in Social Awareness Streams." In short, their answer was yes, for 80% of Twitter users, it's really all about "me." And, with a clever bit of coinage, dubbed these individual vessels of self-absortion "Meformers." In many ways, this finding is hardly surprising. It is a veritable old chestnut that social media are mostly comprised of the self-centered and self-interested chatter of a new generation of narcissists. But, when a bit of research so profoundly confirms a popularly held ideology, it's almost always worth taking a closer look. I'm not prepared to dispute that stereotype, but this sloppy bit of research hardly confirms it. Let's look at where it breaks down, both theoretically and methodologically: (*)

First a brief recap:

The researchers describe Twitter as a subset of "Social Awareness Streams," characterized by the following three traits:

1. They are public
2. They are brief
3. They appear in a "highly connected social space, where most of the information consumption is enabled and driven by articulated online contact networks"

The analysis relies on a prior content analysis (Java et al) that hypothesized three "distinct user activities" (emphasis mine)

1. Information Seeking
2. Information Sharing
3. Social Activity

It pains me to do so, but let's ignore for the moment the problem of identifying as distinct activities "being social" and seeking or sharing information other than to say that, under any close examination, the distinction breaks down beyond utility. There is a grand (and misguided) tradition in the study of language of assuming uni-functionality in communicative practices, so these researchers are in good company. Nevertheless, there are five major problems with the research that fatally undermine the researchers' conclusions:

1. The data are not an accurate representation of the uses to which individuals put Twitter. Vast swaths of users and uses are excluded from consideration by design.

The researchers downloaded about 125 thousand user ids from Twitter's public timeline (which, I understand is itself a random sampling of Twitter users.) Then they made some deep cuts, including "active participants", those who had at least 10 friends, 10 followers and had posted 10 messages and excluding "organizations, marketers or those who 'have something to sell.'" Each of an unspecified sub-sample of the 125k users was carefully examined to identify 'personal use.' It is unclear from this preliminary report how many users were identified in this intermediate sampling of the 125k, but 911 qualifying users were identified from which 350 were randomly selected. The entire public api-limited twit streams of each of these users was downloaded, and ten messages were 'randomly' chosen for each user, but only non-reply messages were included (emphasis mine). According to the example for the category type Anecdote (others), 'mentions' were included. Since the data tweets were plucked randomly from context, it appears the researchers used a simple syntactic sort to exclude replies but include mentions. In other words, if the tweet started with an @, it was treated as a reply. Otherwise it was a mention. (I've written a little about the syntax of replies and mentions here, if you're curious.) They indicate that 13 users had fewer than 10 qualifying tweets in the downloaded corpus. The result was 3379 qualifying messages. (I contacted the researchers about the odd discrepancy here. Under ideal conditions, you'd have 3500 qualifying tweets. But the corpus falls short by 121 tweets. If each of these users had only one excluded tweet, the difference would be 117 tweets. In response to a query, the researchers indicated that there were other reasons that individual tweets were excluded, including due to 'language' . We can only hope that the final report spells out the details of the data set in more detail. So, so far, in an analysis of how individuals use Twitter, streams that appear to be selling something, and organizations, are excluded in principle as are individual tweets that, by a syntactic test, are judged 'replies.' It is certainly odd to set out to describe how people use Twitter by excluding, in principle, so much of the information seeking and sharing as well as social activity.

2. The range of content categories--they identify nine distinct categories--is unmotivated other than by the consensus of the coders/researchers.

Meformer Message Categories

Why are there nine categories? No reason in particular. In an initial sort, the researchers came up with seven categories, and when they sent the data to coders, they got some feedback that led them to expand it to nine. But on examination, these categories are a bit hard to describe as bounded, especially absent any context. And since eight of the nine categories appear to contribute to the 80% "meformer" number, the methodology seems skewed towards that result.

3. The data evidence a high degree of ambiguity, even so.

Tweets were scored by two coders each in order to sort them into these nine categories. According to the authors, "[o]ver-coding was not a problem as messages had 1.3 categories assigned on average." Really? Some quick math: 3379 messages x 1.3 categories per message = 4392.7. That means about 1013 messages--nearly one in three--were assigned to different categories by the two coders who looked at them because, presumably, they had either multiple kinds of content or were otherwise ambiguous. That's a bit of a problem for your sorting system!

4. The examples presented, presumably, as "best cases" themselves fail.

Let's look at the examples provided in table 1 above. Under "Me now" one example is "tired and upset." There is no indexical marker by which to infer the subject of this tweet. Stripped out of context, it could be almost any of the categories listed above, including Information Sharing! Absent a grammatical subject, the only way to resolve the content of this tweet is by appeal to a stereotype of the tweeter, namely that he or she is self-absorbed by default. How about Statements and Random Thoughts? (Again, let's politely ignore that ALL of the above examples, and indeed all tweets qualify as "statements.") One example: "The sky is blue in the winter here." By any conventional understanding, that seems to be "Information Sharing." Sure, it may be relatively trivial or widely known information, but it's formulated as 'information.' An example of Anecdote (others)? "Most surprised dragging himself up pre 7am to ride his bike!" By the convention established in the Me now (ME) category, by default, the person "most surprised" here is the tweeter, so the anecdote is really about his or her surprise, not the other user's unusual fitness effort. Each of the examples given is pretty difficult to defend as occupying any single message category as described above. The coders are to be forgiven!

5. The narrow focus on individually coded tweets, while methodologically simpler, ignores the 'social' part of the medium (especially the exclusion of @replies) leading to an inherent bias towards individual report type tweets.

Why would you seek to exclude "replies" in a study of how people use Twitter? The answer becomes clear when we look at how the coding procedure was conducted so that it produced discrete content categories. In short, excluding replies makes it easier to code individual tweets, but the effect is in privileging just the kind of data whose numeracy you are actually trying to measure. When you eliminate the context in which tweets are embedded, it only becomes easier to sort those tweets if you appeal to conventional stereotypes for evaluating them. But to stipulate self-centeredness is not to discover it. If you set out to find self-centered individuals in the world of social media, you are likely to be successful!

In sum, this preliminary report's reenforcement of the ideology of self-centeredness in the content of social media is a precipitate not of the careful analysis of the data in social awareness streams but of the stipulations of a methodology and an idiosyncratic system of categorization. The selfishness of social media is an ideology stipulated by the methodology here, not actually discoverable in the data. Self-centeredness in social media, it appears, is less a product of 'meformers' than 'mesearchers.'

*(Note: I contacted the researchers personally to try and get a closer look at the data corpus, but because they are still at work on it, they were reluctant to share it. And, it should be noted further that this is a preliminary report, and the final report will reportedly spell out some of the ambiguities in the methods).


"Is it Really About Me?  Message Content in Social Awareness Streams"
Mor Naaman,  Jeffrey Boase, Chih-Hui Lai
Rutgers University, School of Communication and Information