Nov 22 2013
Do we ever use all the data that’s out there (and that we generate every day) for the benefit of our communities? How could the data that we generate help our communities see themselves in new or more useful ways? The burgeoning new industry around Big Data is built on the premise that collection is ubiquitous, cost of access and processing are falling rapidly, and our knowledge about creating useful visualizations is growing. For the last year I’ve been mulling whether these assumptions apply to the communities in which I participate (as a member or as a consultant).
There are two characteristics of KM4Dev, where I’m both a member and (for a brief period of time) a paid consultant, that make the community and its potential use of data significant. The first is that the community lives on multiple platforms that are completely independent of each other, so there is always some uncertainty about community boundaries and who is included or not in any given data frame. There are many questions we might have that would require combining data from the several different platforms and that can take a lot of effort or be impossible for various reasons. All of that makes using data for a community more challenging. Communities that use multiple platform like KM4Dev really are the norm; one-platform communities are the exceptional (but well-studied) case. The second characteristic of KM4Dev that is important here is that it is independent and self-funding. Often it is sponsors and advertisers that are driving the collection and interpretation of data about communities: since I am both a member and am funded to look at KM4Dev at the moment, it seems important to dig into the issues and explain myself. I’ve put together several blog posts from the current project I’m doing for KM4Dev.
I’ve pulled examples from my phase 1 report and added some discussion and details that don’t really belong in the report. I’m trying to do several things in this series of posts:
- Unpack some of the data access issues
- Illustrate some of steps involved in producing a graph or table output
- Say what I see in the output from a community development perspective
- Describe what next steps could or should be and why I stopped where I did
The first step of course is recognizing that there might be data that’s relevant and helpful. Community development folks like me tend to be a very intuitive lot that’s oriented toward person-to-person interaction, so it’s a bit of a leap to reach for “hard data.” As I’ll show in the case of KM4Dev, figuring out where the useful data might be and how to get access can be a bit of a project. One obstacle is that data access and analysis is quite professionalized, involving specialized tools and invoking high standards. Some day I can show you the scars from tongue lashings from statisticians telling me about my crimes against good practice: painful and discouraging at the time. I’m going to argue that rough, quick and dirty is pretty easy and good enough.
Sometimes there is no obstacle to getting access to the data. You have access because you are a member and it’s already summarized for you. Linkedin, for example, has ready-made statistics for its groups:
It has to be said Linkedin isn’t where most KM4Dev interactions occur, so we have to consider whether the bar chart in Figure 7 represents the general KM4Dev population.
The answer is: probably not, but the graphic gets us thinking about the organizational affiliation of KM4Dev community members. When I think about it, people who participate in KM4Dev conversations are fairly senior and they are quite accomplished. All I had to do to produce Figure 7 was capture a screen-print (with Snagit) and paste it in my report. Access was easy but interpretation not so much. Still the point is that no data or graphic is going to be the last word: suggestions, pointers and indications are all we should expect. If a graphic produces a good question, it has done its job.
Things aren’t always that easy, hopefully our curiosity is stimulated and we keep looking. In this case, it turns out that the KM4Dev community’s front door (http://km4dev.org) is a Ning site that has been set up to collect interesting data from people when they register. And several years ago I offered to help do some chore that required admin access. I had forgotten that I had access but it turned out that I still have admin access to the site, so I could just download a CSV file that could be easily opened in Excel or manipulated by other programs. I have found that having to explain to someone what you are going to do with the data before you’ve actually seen it can be difficult: you don’t know what you are going to find or what you might do with the data until you start interacting with it. And getting authorization, getting your hands on it and certainly getting the data in a form that allows you think about your community can take more time than you might think or than you might think it’s worth. In fact, having access to the data will almost always raise more questions than it answers and could even be misleading, so this first step is one where it’s easy to give up.
Figure 6 – Job descriptions
The member dataset in the Ning site has a column labeled “Occupation/Title”. When they register, people can describe themselves using whatever terms make sense to them. Putting these descriptors in http://www.wordle.net reveals variation in punctuation and capitalization which we can standardize with a text editor (in my case, http://www.textpad.com/). A wordle is a nice way to visualize the several words that go with the main ones, “knowledge” and “management.” The next step with this bit of data might be to standardize all the job titles so as to group community members and, possibly, understand what those differences might mean.
Here is a simple display based on calculating a member’s age from the birth date they provided when registering on the Ning platform.
Figure 5 – age distribution
What is striking to me about the age distribution (apart from the several cases where the age calculation results in a number great than 90) is how very young KM4Dev membership is. The next step might be to break the dataset down into age bands to see whether there are any patterns of participation that vary with age. For example, do older members post to the Dgroup discussion more frequently than the young ones?
The final example in this post involved quite a bit of effort to produce a simple result, showing the percentage of members from the different types of organization:
Table 1 – Organizational affiliation
- Government agency/Bilateral,INGO
In these three samples, you can see that some people just chose one category while others chose many(up to 5) and they were recorded in different orders. In addition, sometimes the different categories were separated by a comma and sometimes by a horizontal bar character (“|”). Using a wonderful and free data cleaning tool named OpenRefine, it takes six steps to count two people in the INGO category in previous three examples, one half person each in Academic/Research and Government agency/Bilateral. In addition, I collapsed the counts for NGO and INGO as well. OpenRefine does a great job of saving the code that you construct interactively and it also keeps a step-by-step description of what you’ve done. Here is the partly-intelligible description of the six steps:
- Create column n-cats at index 1 based on column Row Labels using expression grel:value.split(/,|//).length()
- Split column Row Labels by separator
- Create column weighted-once at index 13 based on column once using expression grel:value / cells[“n-cats”].value
- Create column weighted-many at index 15 based on column many using expression grel:value / cells[“n-cats”].value
- Transpose cells in 11 column(s) starting with Row Labels 1 into rows in two new columns named column-key and column-value
- Remove column column-key
There were some additional steps in Excel that were involved in producing this table, but you can see how a considerable amount of complexity is involved in producing something simpler to look at and easier to think about.
What I see in this table is a very productive diversity of organizations represented. And that is both a source of fruitful dialog at the same time as it creates a challenge: KM4Dev probably has a different value proposition for people from different types of organizations and people from those different types of organizations are able to make different kinds of contributions to the community. Garnering support for infrastructure or other necessary investments (like the occasional bout of data analysis) from such different kinds of organization is likely to be very tricky. A further step would be to compare the amount and kind of contributions that come from people in those different categories. It would also be interesting to look for common characteristics among people who combine categories in different ways (e.g., one foot in academia and one foot in any of the several other categories of organization).
3 responses so far