Bird E-mail Data Mining MSR 2006
Summary
The authors looked at the Apache developers mailing list archive and CVS repository commit logs, considering messages covering period of four or five years. They did so with the goal of studying communication and collaboration technologies (C&C) in software projects, particularly in open source software development. They are specifically interested in how activities in C&C correspond to development activities in the source code: what are the social properties of the developer network; do active communicators also make a lot of source code changes; do developers and non-developers play different social roles; and do the most active developers have the highest status among developers. They examined an OSS project because most/all communications are purposely publicly available.
The authors looked at each participant in the mailing list, and divided the group into developers (those who contributed code or documentation changes to the CVS repository) and non-developers (those who didn't). For each participant, they looked at how many messages the person sent sent, how many of their messages were replied to, and three social networking measures: in-degree (the number of edges connecting to a node in a directed graph; in this case, the number of different people to whom a person has replied), out-degree (the number of edges emerging from a node in a directed graph; in this case, the number of individuals who have replied to a person) and betweenness (the number of shortest paths that go through a node; high betweenness indicates that a person acts as a gatekeeper or broker, playing a role in many interactions). They also presented a directed sociogram of the Apache mailing list archive in which the arrows indicated who responded to whom more often (but didn't do much with it).
They found that messages sent, messages replied to, in-degree, and out-degree follow a Pareto distribution (a power law probability distribution; a few people send a lot, but most people send a little), the latter showing a "long tailed degree distribution, characteristic of small world networks" (p. 141). There was a strong relationship between the number of messages sent by someone and the number of distinct people that respond to them (p. 141). They found a high correlation (Spearman rank correlation of 0.80) between messages sent and number of source changes made, indicating that C&C activity is correlated with development work (p. 141). There was a lower correlation between messages sent and document changes.
Developers do act as brokers or gatekeepers more than non-developers (p. 142), and generally have higher status (computed as what?), and developers who do more source code changes play more significant roles in the mailing list. Higher activity in source code changes is strongly correlated with higher activity in the mailing list; document changes are less so correlated. Generally, high in-degree, out-degree, and betweeness are correlated with status (how?) and source code change activity.
Issues
Data extraction
They used the Reply-To: address and Message Id: of each message to which a message is a reply (if any) to determine who replied to whom, and suggest that you could look through the contents for quoted text attributions. The sender of a reply is "one who found the initial message of interest" (p. 139).
One of the few groups to deal explicitly with e-mail alias unmasking: many people have more than one e-mail address, and ensuring that we count all the e-mail from those different addresses as belonging to that person is not trivial. They used a clustering algorithm plus manual inspection to develop a lookup table of e-mail addresses to names. The similarity measure they used for the clustering is based on the fields in the From: line:
From: Chris Malek <cmalek@caltech.edu>
They compared the normalized names to names and e-mails to e-mails using the Levenshtein distance, compared names to e-mails, and took the max scoring of the three (p. 139). They did this for all pairs of <name, e-mail> tuples. They used a similar method for unmasking CVS aliases.
Social networking measures
They comment on connectedness, but don't use it except to say that the most highly connected people in the Apache network are, in fact, the most productive developers (p. 140), and that they are doing further research into that.
The "small world network" is a statement about mean shortest path and clustering of the network. Small world networks exhibit power-law distribution of degrees of its nodes. Scale-free networks follow an exponential distribution.
They used messages sent and out-degree to make the statement about number of messages sent vs. number of unique repliers. They're doing further investigation into this.
They used betweenness with in-degree and out-degree to show that developers do act as brokers more than non-developers (p. 142), and generally have higher status (computed as what?), and that developers who do more source code changes play more significant roles in the mailing list.
Critique
I'm interested in this paper primarily for their data extraction and analysis techniques. What is also interesting is how they link e-mail activity with actual work product activity. This could be interesting in talking about power networks, since it shows actual action in addition to communication and coordination.
The authors compute "status" in some way from in-degree, out-degree, and betweenness, but they don't explain why Seems to me that they mixed up in-degree and out-degree directionality.

