Dec 01 2013

Mapping a community – easy and not-so-easy

Published by at 7:07 pm under code,community data,Technology

I’m resonating with how Joitske Hulsebosch has organized “Tools for social network analysis from beginners to advanced levels.”  It’s always safer when you can start at “the shallow end of the pool” and get into deeper and deeper water as you go.  In diving into the real-life river of data, we may not know just how deep things are — and only discover the shallow end (meaning “easier”) methods later! This is the second in a series of posts about my attempt to hold a mirror up to KM4Dev in a project funded by a grant from IFAD.

As I mentioned in my previous post, sometimes there is a ready-made visualization, if we can turn it to our current purpose.  Figure 4 in my Phase One Report is almost lifted straight out of Dgroups – the host for KM4Dev’s email discussion list. Here is a world map showing where KM4Dev members are from:

dgroups-membership-map-oct2013-raw

If you are a member of KM4dev you can have a look for yourself, as membership changes and the screen-shot above is now out of date:  https://dgroups.org/groups/km4dev-l/members/overview

However, I found that map to be hard to read, even thought it’s a nifty way of looking at the community’s geographic distribution.  So I used Snagit to capture it and try to make it more legible for inclusion in the report.  Snagit has a color substitution tool the make the grey and the green darker: Fig-4-dgroups-membership-map-oct2013

 

To me, the number of countries and people registered from “the developing South” is one of the indicators of KM4Dev’s value – it brings practitioners from all over the world into an ongoing conversation. The Dgroups platform has other depictions but this is the most vivid one.  Increasing the color contrast is very easy and it’s a big improvement.

The next example is a good bit more involved, using the wonderful and free data cleaning tool OpenRefine to standardize the names of the countries people give when they register on the KM4Dev Ning site.  One of OpenRefine’s most remarkable features is its ability to compare and reconcile values in your dataset to a public source of “standard names”, in this case country names:

Original Value Modified Count
Armênia Armenia 1
Côte d’Ivoire Cote d Ivoire 3
China China, mainland 6
Colômbia Colombia 39
Congo Congo, Republic of the 1
Congo, The Democratic Republic Of The 2
Laos Lao Peoples Democratic Republic 3
Moldova Moldova, Republic of 1
Palestinian territories Palestinian Territory, Occupied 3
Russia Russian Federation 7
South Korea Korea, Republic of 6
Syria Syrian Arab Republic 5
Tanzania Tanzania, United Republic Of 13

When you have almost 3,500 rows (representing one member in each row) a tool like OpenRefine is pretty important. This was my first time using OpenRefine’s reconciliation process; it felt like I was stumbling a bit, so I’m not going to try to describe it.  But once the country names have been standardized, a Pivot table (in Excel or Google spreadsheet) gives you the counts by country.  Google Spreadsheets has a nifty tool to chart countries on a map.  I presumed that standard country names were required.

geo-location-based-on-ning-dataset

The dataset and this map are available.  Larger and darker dots represent a greater number of registered members and I’ve made the ocean bluer than Google’s version. A bit of playing around with the mapping tool suggested that mapping the logarithm of the count of members was more useful than mapping the raw count.  Note that with this many countries in the table, the map takes a while to render.  Until I figured out it was just slow, I was worried that I had somehow dropped the counts for the United States, which has a lot of members but is alphabetically toward the end.

Although this map takes a lot more effort to produce than the Dgroups example, it doesn’t suggest a radically different distribution of KM4Dev members around the world, even though it depicts data from a completely separate registration process (people register on Dgroups but not Ning, and vice-versa).  However, there’s nothing like having the detail data in hand to permit further analysis.  I did not have time to do so, but several possibilities spring to mind imemdiately:

  • Amount of activity: are people in all countries just as active?
  • Recency: how has Ning registration spread over time?
  • Role: are leadership roles spread as evenly as membership is?

It might be hard to stop.  So many questions, so little time.

The last example looks at yet another dataset that represents the KM4Dev community, this time a survey of members reported in the KM4Dev Baseline L&M Survey 2013. The authors of the survey were kind enough to give me a spread-sheet from Survey Monkey.  I was wondering whether survey respondents were as widely distributed geographically as KM4Dev members (as suggested in the other maps above).  Let’s start with what I found and look at the details afterward.  The survey did not ask for any details that would identify an individual respondent so cross-referencing of any sort was not possible.  However, I used the IP address that Survey Monkey collects to give me some assurance that the survey includes people from around the world.  Here are the results in tabular and map form:

Many responses per country: United States: 29; United Kingdom: 17; Netherlands: 10; Switzerland: 7; Canada: 6; Ethiopia: 6; India: 5.
Four responses per country: Colombia, Lithuania, and Uganda.
Three responses per country: Brazil, Nepal, South Africa, and Spain.
Two responses per country: Bangladesh, Belgium, Costa Rica, Ecuador, France, Germany, Kenya, Malaysia, Mexico, Nigeria, and Philippines.
One response per country: Australia, Botswana, Burkina Faso, Chile, Denmark, Djibouti, Fiji, Indonesia, Jordan, Pakistan, Paraguay, Peru, Senegal, Trinidad and Tobago, Tunisia,
Three responses had blank IP addresses so no country information is available.

This map was produced by the Google Spreadsheet map chart tool using the same data:
L&M-survey-respondents-2013-locationThat seems like a nice footnote to the a good study: geographic distribution makes us trust the survey’s conclusions more.  But this last example is also a reminder that mapping in particular and data analysis in general can take a lot of effort to gain modest insights.  But you don’t know until you look, so you ought to look.

Specifically I was interested in whether people’s satisfaction with KM4Dev’s several platforms had a geographic pattern.  Does lack of bandwidth affect satisfaction with more bandwidth-intensive means of communication (e.g., Ning compared to Dgroups)? The received wisdom is that email is more accessible when you are in a low-bandwidth situation or where Internet access tends to be intermittent.  Here’s the final table I arrived at which compares “South” and “North” in terms of “satisfaction with the Ning platform.” “North” here includes Europe, North America and Australia, where roughly speaking we can assume higher and more stable bandwidth than what is found in other countries (here, following European custom, labeled “the South”):

South North South % North %
Very much 1 4 3% 7%
Much 2 9 5% 16%
Sufficiently 12 11 30% 20%
Not very much 15 21 38% 38%
Not at all 10 10 25% 18%
 Total 40 55 100% 100%

I don’t see a systematic difference between the South and the North columns, do you?

Of course this table is more of a provocation than “an answer” because of who chose to respond to the survey and many other factors.  For example, IP address tells us where respondents were when they completed the survey, not when they formed usability or satisfaction opinions.  It is interesting that one of the more active groups on the KM4Dev Ning site is “KM4Dev for Africa“!  It seems meaningful to me that the membership of KM4Dev is spread around and members are in (or come from) so many different countries. But the fact that opinions (and possibly behaviors) do not seem to be correlated with geography in this case is a reminder that people are people, that contribution and participation are happening all over the world. There are probably more differences with regard to knowledge management,  knowledge sharing, and access to the Internet within any given country than there are between countries.

I have found that it can take some thought to turn the Excel spreadsheet that Survey Monkey generates into something that is easy to use, but it’s not too difficult.  But how can you get location from an IP (or “Internet Protocol” address?  Well, you guessed it,  OpenRefine to the rescue. Once you have simplified Survey Monkey’s spreadsheet, you can import it into OpenRefine and look up an IP address to find a location.  OpenRefine can combine a string such as http://freegeoip.net/json/ with another string such as 173.194.33.128 (which happens to be the IP address of “Google.com”) to form a URL such as follows for all the rows in an OpenRefine “project” (in our case, 144 responses to the survey):

http://freegeoip.net/json/173.194.33.128

OpenRefine has a nifty command to create a new column with the results of retrieving a URL in an existing column.  When you do that with the URL above, you get geographic details for Google in JSON format:

{"ip":"173.194.33.128","country_code":"US","country_name":"United States","region_code":"CA","region_name":"California","city":"Mountain View","zipcode":"94043","latitude":37.4192,"longitude":-122.0574,"metro_code":"807","areacode":"650"}

Freegeoip.net can return the same information in other formats and will show you your IP number.  Once Freegeoip has done its work, OpenRefine makes it easy for you to create new columns to split out the details into separate columns with codes such as value.parseJson().country_name

Similar commands are used to extract longitude and latitude, etc.  Finally, latitude and longitude are handy both for grouping countries into continents or other aggregations (as above, where I grouped countries into “north” and “south”) as well as for double-checking your work, so that “Belgium” doesn’t end up in the “Developing South” group:

 Open-Refine-facets

OpenRefine advertises itself as a tool for working with “messy data” and data about communities will almost always be messy!  It’s worth learning how to use.

2 responses so far

2 Responses to “Mapping a community – easy and not-so-easy”

  1. Nancy White says:

    SUPER interesting John, and the maps confirm in a very general way my hunch that we are more globally represented than we sometimes thing. It would be interesting to look at those geographic numbers with some relationship to population of the countries. Or if we had the data, vs potential KM4Dev membership/audience in a particular country but that is just dreaming.

    I don't think we seek simply representation, but connection to people who share interest in the domain of knowledge for development. And that is a wide net.

    Cool, geeky stuff!

    • smithjd8 says:

      Thanks, Nancy. I think you put your finger on the main point, which is: "So what?" What actions do we take? I think one response from my perspective is: "Take KM4Dev seriously — it really is very diverse." That may seem very slight, but it could have a lot of small consequences that add up. It's not just "some mailing list."

      If population would be an interesting denominator (e.g., "KM4Dev members divided by population = country representation") then as you say, "KM population" would be another one. But so would dollars of aid invested or spent. What about foreign direct investment or remitances from abroad?