R and Google spreadsheets are powerful partners for verification, exploration and delivery of data to people in communities and organizations. Recently I wrote a convenience wrapper around the gs_upload function in Jenny Bryan’s wonderful googlesheets package. But it takes some context to explain how I use it the way I do and why I think it’s important, so this post has a little bit of R code and a lot of context. It expands on a Lighting Talk I gave at the Portland R User Group last Wednesday.
In general I’ve found that getting tables or data frames from R into a document is fairly clumsy. There are several packages available that are designed for the purpose, but to me they seem like more trouble than they are worth. Often they are easy on the R side but working with the output on the word processing side ends up being laborious and clumsy. However, once a data frame is uploaded to a Google spreadsheet, cutting and pasting some or all of it into a document is very easy, fast, and it doesn’t add junk that you have to remove later on. Apart from that, a Google spreadsheet is sometimes even more handy than RStudio’s data viewer (which is saying a lot) for inspecting and pondering what’s actually going on in a data frame.
But the real reason for the small bit of R code that I am sharing is that it helps of with the verification, exploration, delivery, and use of data in a social context. A Google spreadsheet makes it easy to control access to your data by sending someone a URL and simply giving them access to it. It’s very easy for two people to look at it together, say, on a phone call. One or more people can annotate, sort, filter, plot, or pivot the data in it to their hearts content (depending on their level of skill). If the conversation involves annotating it, a Google spreadsheet is nice because you know that everyone is looking at the same version. In addition, it actually works as a delivery method in that the data can then be downloaded into whatever environment someone wants.
The context of data and of analysis
But back to the subject of context. Back in the 1970s I read a lot of books and articles by statistical visionary John Tukey. (My copy of EDA is one of my most thumbed, battered and annotated books.) I can’t find the exact quote, but I’m sure that somewhere Tukey said that “Domain knowledge is essential for data analysis.” I found the statement to be particularly troubling because I didn’t think I had any (certainly not enough) domain knowledge. I wanted techniques for data analysis that would give me insight in domains where I was pretty sure I was clueless. I wanted to be able to analyze data without knowing much about the context. Disappointingly, it turns out that, despite what you might think judging by most of discussion about them, statistical techniques and software are no context-free silver bullet! In retrospect I was learning to appreciate the importance of domain knowledge when in those years I would always take a sick day to read an entire SUGI proceedings the day after it arrived — in part to check out what contexts framed the work of other SAS users.
As I think about that issue decades later, it seems to me that domain knowledge is indeed involved in at least these facets of sense-making with data:
- Data collection – where and how the data was collected and what it means
- Coding – how it was recorded and coded (and why and how to decode it)
- Retrieval – what kinds of ethical and technical issues there are around getting and using the data
- Munging – the pain and pleasure of cleaning, combining, and organizing the data for use
- Exploration – having some intuition as to where to look and how to look at it
- Analysis – understanding what kind of data reduction or analysis is relevant or customary
- Value assessment – judging the value of the data and the results of an analytical effort
- Communication for action – communicating the results to people who can take action
These contextual facets (I just made up or unconsciously stole this list, and would love to hear about your list) seem very important to me. Part of my motivation here is to argue that we need bring more context into R Group discussions and presentations. In Why Information Grows: The Evolution of Order, from Atoms to Economies, César Hidalgo (2015) points to this same issue:
It is hard for us humans to separate information from meaning because we cannot help interpreting messages. We infuse messages with meaning automatically, fooling ourselves to believe that the meaning of a message is carried in the message. But it is not. This is only an illusion. Meaning is derived from context and prior knowledge.
Although we manipulate information with R, what we care about and what we seek are the messages that it encodes. Therefore we always need to be aware of context, drawing on whatever domain knowledge we have that is is consciously or not turning our information into messages.
Communities and their technologies shape data and message
Slow forward 35 or 40 years and I do have some domain knowledge about stewarding technology for communities. How communities, organizations, and technology all interact has been a long term interest: their interaction is important and determines what data that the community might have or need about itself. I think each item on my list of sense-making facets interacts with issues of community, organization, and technology.
In following diagram from a blog post 2 years ago, the red lines represent organizational boundaries and the ochre lines suggest community boundaries. I’m still fascinated by the two examples on the far right of this diagram, where an organization (linked “personbytes” of knowledge, in Hidalgo’s terms) is present and plays a role but does not contain the community:
Although the landscape changes constantly, combining technologies, understanding how they support memory practices, how they do or don’t work together, and how they support being and learning together are a big challenge for communities and for organizations — just as much as they were when we were were writing Digital Habitats. Here are some examples of the data issues that pop up for free-standing communities that depend on technology for their existence but are not contained by an organization:
- The Portland R Users group is just fine using Meetup for chit-chat, scheduling, and sharing resources. A nice meeting room like the one that Simple has provided is a key resource. And of course people’s willingness to give talks is what keeps it alive. Although members may be affiliated with an organization, the community’s “organization” is the determination of one individual who keeps the conversation going (and brings pizza!). Meetup’s clever “Good to see you!” follow-up emails accomplish an individual purpose (recognizing and greeting other people) at the same time that they gather data about the community: attendance and social network information. The data on a participant’s ratings of a session or who they greeted are available through an API and may be used by Meetup Inc., but it is not readily available to the community itself. The community’s “organization” is supplied by Meetup.
- An open source project like OpenRefine relies on Github for its front door, to manage its code, its binary downloads, and its documentation wiki. An email list, a custom search engine covering community blogs and a Twitter hash tag complete the community’s basic technology infrastructure. Its “organization” basically consists of the list of contributors. Although that is preserved, imagine how much community history was lost as the tool transitioned from its original creators, Metaweb Technologies, Inc., when they were acquired by Google, which then spun it off as a community-supported product. Communities are often capable of holding long histories but it’s not automatic, nor necessarily supported by community infrastructure.
- KM4Dev sprawls over an email list, a wiki, a Ning site, a G+ community, a hash tag, group meetings, Skype group meetings, and other platforms and venues that are unknowable unless you were there or heard about them from someone who was there. It’s “organization” is constantly in question but it has managed to survive a long time and obtains funding to study itself and accepts donations even though it isn’t a formal entity. Each of its platforms has a different way to keep track of a member and her activity so data integration is very difficult. That means the community can’t use its data to argue to the big employers where its members work that KM4Dev is a key part of their professional infrastructure. It may be that its anti-organization stance, which is reflected in the loose coupling between its tools, is a response to the over-organization of large development bureaucracies.
These three free-standing communities are viable and productive with a minimum of organizational structure. Their data resources mostly serve their needs and are in alignment with community energy. However, as communities grow in size,complexity, ambition, or age, they need some kind of organization at their center (as depicted on the right in the diagram above). The churches, temples, and meditation centers that I’ve been studying and working with over the last 5 years all eventually need some kind of organization to carry out administrative functions on behalf of their communities. The question is always: how much “organization” is enough? — or too much?
Collecting data and using it depends on the organized activities that typically happens in an organization. The arduous task of collecting complex data, using it for diverse purposes, across time, across platforms, and across diverse social contexts requires even more organization. But what is depicted, what the messages are about, is often about community participation — the voluntary and more chaotic side of life that can’t be captured. The question today is how much data resource is enough? — or too much?
To get at the messages in the information that one of these organizations keeps, we need to remember that the organization and the community jointly frame data gathering, storage, integration, use, and meaning. To understand data issues, we have to consider questions such as:
- Balance between community and organization: are the ways that one serves the other effective and well-understood? are parts of the community or the organization more important (or out of reach)?
- Life span and length of memory: how important is it to remember participation, adherence or contribution? how far back does history go? how much change is required to “stay the same”?
- Social context and diversity: what locations, languages, or different purposes are represented? how consistently are messages encoded and decoded?
- Technology dependency, diversity and integration: what parts of the organization or community’s life need to take place on technology owned by the organization itself versus platforms like Facebook or LinkedIn? how spread out over multiple technologies is the community and how important is integration?
These questions might sound ponderous if we’re just talking about one query or data project but I think they emerge when we do more with data resources in a community-related organization. We need to deal with all the traditional organizational issues as well as the kind of sense-making issues that communities are always engaged in.
From my R data frame to a distributed Google spreadsheet
Moving information from R to a Google spreadsheet is fairly straightforward. Taking care to transfer the messages requires some extra steps. For example, what’s a convenient, clear, and consistent name for an object in my R code is not necessarily helpful when delivered to someone else. Here are some changes I make to names as I upload an R data frame to a Google spreadsheet:
- A terse data frame name becomes a longer and more descriptive spreadsheet name
- Variable names are expanded to be more descriptive column headers
- I never use capitals in variable names but I find that they make column headers easier to read
- I replace underscores and dots in variable names with spaces, so that column headers consist of words that easily flow into more than one line
Here is an example where I upload a small data frame to a Google spreadsheet.
Here’s a snapshot of the resulting Google spreadsheet:
Once the data frame is in a Google spreadsheet it’s helpful (and very easy) to:
- Freeze the first row, so that column labels don’t disappear when you scroll down
- Bold the first row, so it stands out clearly
- Center and flow the text in the first row, so that the longer column header isn’t cut off
- Set the width and formatting of each column appropriately (e.g, set decimal places)
- Turn “filter” on to allow subsetting at the click of a button
- Sometimes, specific rows or columns are set to a different color to call attention to a specific issue
Getting to the community’s message
So what domain knowledge that is relevant to data about, by and for a community with an organization at its center? Despite years accumulating domain knowledge about communities, organizations, and data analysis, there is a lot that I don’t know about the creation and use of the data I’m interested in. Working on behalf of some 250 centers of different sizes, nationalities, and levels of maturity around the globe means that even narrowing my focus to one database, there is not one context but many. On the data creation side, I find that there are different data entry practices and the volunteers who enter the data turn over regularly; learning about the data and its context is an ongoing process for new volunteers and therefore for me. Despite common intentions, many inconsistencies and blind spots aren’t visible until people can see the results of their work in a larger or comparative context — like a handy Google spreadsheet.
When a data frame is a report that involves greater complexity than just a simple list, it requires additional explanation such I suggest in the following example. Hints and suggestions expand on this column-by-column documentation:Although some of the data frames I upload to Google spreadsheets are single-use, look-once, copy-once, and throw away, some of them are longer-lived. When R is joining information from many different sources (e.g., MySQL, Google Analytics, MailChimp, web scraping, etc.) or is replicating a report many times over, a complete description of the data and its context is worth the time.
But nothing I write about the data is the last word. Eliciting knowledge from sense-making partners in a community and its organization is a key step in making the data resource useful. A Google spreadsheet seems like an ideal vehicle for negotiating and understanding the different assumptions and meanings that transform the information that I have into a meaningful message for my partners. I find that to them “information” is boring, but messages about people, processes and possibilities are interesting because they can lead to growth and benefit.