Learning Alliances

Dec 05 2018

Community and organization intertwine

Published by John David Smith under Communities of practice,Stories

I’ve been thinking about how community and organization are intertwined, especially when they are interdependent like they are in churches, synagogues, or mission-driven organizations like Amnesty International. The formation of a process team that’s focused on governance in Shambhala prompts me to write some of my thoughts down.

Community and organization are different social entities that represent different ways of participating in our human world. One of them can do things that the other can’t. For example, organizations can own assets like websites and other technologies, have payrolls, are bound by law, and have clear accountability. On the other hand, communities can be informal and don’t even exist unless we participate in them, but they are personal and meaningful in ways that organizations rarely are.
The two interact and we switch back and forth between an organizational and community view often without noticing. I’ve thought about the interactions a lot and I get confused. Your mileage will vary.
The most important point is that community and organization can support and augment each other, or can hobble each other. Therefore it’s worth thinking about their interactions and how they are intertwined.

In his book on church governance, Dan Hotchkiss gives a simple and compelling argument for how organization and community interact:

The most important factor in deciding how to organize a congregation for decision-making is its size because no fact about a group of human beings says more ab out it than its size.
Dan Hotchkiss, Governance and Ministry: Rethinking Board Leadership (Lanham, MD: Rowman & Littlefield Pub Inc, 2016), p 99.

In the following table, I lay out some contrasts, describing how each side answers a general question like, “What is it?” In each cell I add in italics what that side can do to support the other. Afterwards I give some examples where community and organization seem to harm each other.

	Participating with an organizational perspective	Participating with a community perspective
What is it?	An organization is a recognized legal person that has formal rules of operation. Can provide venues and infrastructure for community gathering.	A community is a history of clustered relationships and events. Community interactions can keep alive the memories and values that make organizations honest — and valuable to society.
What resources are required?	Organizations depend on a larger social or economic system, laws, money, and produce “outcomes”. Organization can scale or extend community.	Communities depend on the larger society — a fabric of social relationships. Community can validate organization life in (local) practice.
How are roles defined?	Organizational roles are contracted or appointed. Can recognize & formalize community values & provide focus.	Community roles evolve, are negotiated informally over time, are based on participation. Community provides a reservoir of committed talent; members return to community after serving the organization.
What separates the inside from the outside?	Can buy or sell assets; can sell or procure work externally. Can extend community reach or protect it from external threats.	Legitimate peripheral participation enables outsiders to experience community norms and values gradually. Can bring new vitality into an organization including people, ideas, and resources.
How does it visualize itself?	Organization charts and protocols codify relationships and power. Can simplify or close off intractable debates or conflicts that suck energy from the community.	Stories, events, memories are part of individual sense-making and are shared in community life. The community’s memory can keep an organization true to its purpose and its conversations can alert the organization to emerging needs.
How is communication organized?	Messages go through formal, legitimate channels. Can reduce the noise of community chatter and be purposeful about listening to widely separated perspectives.	Conversations are ad hoc, shaped by individual relationships, opportunity, and feeling of relevance. Can provide a grapevine that tells truths that illuminate organizational blind spots.
How does a collective “voice” express something to the world?	Formality enables “singing from the same song book.” Can gather a community’s message and broadcast it.	Shapes multiple, opposing voices into a dialog. Can add depth and breadth to an organization’s point of view.

Here are a few illustrative stories of negative interactions.

In a story about a young pastor who fired a church organist only to be fired himself, Dan Hotchkiss writes: “Informal networks kill silently, so it is not easy to retrace their steps. No doubt Gladys, like most church staff members, had a political constituency all her own. Her supporters did not speak up in the deliberations of the formal church — the first board meeting, where the focus was on her competence as organist. In that setting, it would have felt out of place to speak of personal affection or the fact that Gladys had provided music for hundreds of funerals and weddings and had woven herself deep into the fabric of the church’s life. But in the informal congregation, such considerations no doubt dominated the agenda. In this case it was the informal congregation whose priorities won out. Gemeinschaft is more important in small congregations than in large ones, but it never quite goes away!” — Dan Hotchkiss, 2016, pp 100-101.
Pedophiles in the Catholic church were bound to each other by codes of silence and enabled by an organization that provided a setting and cover for their activities. When their activities were exposed and the organization’s complicity in the cover-up was also exposed, the cost to the organization was enormous. We don’t know details of this story, but these elements must have been there and the costs were real.
When an organization must draw leaders from the community that surrounds it, recruitment can’t be just a formal process to fill leadership positions. In a leadership development project with Juan Carlos de la Puente 2 years ago, we found that Amnesty International’s leadership recruitment and development process had become too procedural and rule-bound, using only “organizational” logic. Their best leaders were deeply aligned with the AI community, but were also quite critical of organizational bureaucracy. We recommended that they treat leaders in Latin America as part of a community and become more purposeful about befriending prospective leaders to get to know them before proposing a specific organizational roles.

If we participate in an organization or community that depends on the other way of participating, we need to alternate between the two perspectives. I don’t think there is a formula for balancing community and organization. You have to be there.

No responses yet

Jun 16 2018

cRaggy 2018: design, feedback & reflections

Published by John David Smith under Communities of practice,Conferences,Event design,R

This blog post describes the cRaggy event at the June 2, 2018 Cascadia R Conf, its design, the logic behind its design, feedback from participants and reflections on how such an event might be better in the future.

Here’s the pith of the how we learn R: The R ecosystem is a marvel made up of a global cloud of people, their connections, their know-how, and their tools. Learning in this ecosystem involves choosing between specific pathways: a place and time, with certain people, using specific boundary objects — in some reasonable sequence of steps. The best instructional, event, or conference design results in increased excitement, inspiration, enjoyment, personal connections, and know-how. In a way, the event pieces that we string together to make up a conference are just like a bunch of R statements that aren’t useful till we put them together with human intention, skill, and passion.

Down on the ground cRaggy started with not much more than the half-baked idea depicted in this drawing:

From the beginning I was thinking of cRaggy as a sequence of steps strung together to structure individual and collective experience, along the lines of liberating structures, with learning objective as the main goal.

The cRaggy design process was itself a string of collaborations with the conference organizing committee. They helped by validating the original idea and by building on it to produce the final event. Chester Ismay and Ted Laderas, in particular, had lots of of specific suggestions for datasets, which was a key element of the design. Chester mentioned that Andrew Bray’s students were doing a lot with local, civic data; and one of Andrew’s suggestions was the BIKETOWN dataset which was the one we ended up using. Chester also put me in touch with Thomas Mock, who’s been running the Tidy Tuesday events. We borrowed a lot of ideas from Tidy Tuesday and email exchanges with Thomas were very helpful in evolving the final design. Ted has written up some reflections about the overall conference design.

This year’s cRaggy event

We announced the cRaggy event in January, without very many specifics. As the conference approached, we published a set of instructions for participants, calling it the cRaggy gRaphics show-and-tell. Here is the super-simple form that people completed when they submitted their entry on the day of the conference:

cRaggy entries were all posted in one corner of the 360 person capacity room where the conference was held. The beer and food were served in the same corner at the end of the day. People could stand around discussing the entries during the whole day.

The three entries that received the most votes gave a 5 minute lighting talk at the end of the day:

Pierrette Lo, Locations and numbers of rides ending outside a BIKETOWN hub
Charlotte Wickham, When do rides start?
Kevin Watanabe-Smith, AVERAGE BIKETOWN TRIPS; Median trajectory of weekday trips (2016-2018)

Design to Balance Opposing Factors

During the design phase and on the conference day, I was aware that “design with social learning in mind” meant balancing two opposing forces. This table to suggests how those forces alternated, more or less in chronological order, as a kind of learning peristalsis.

Concentrate, constrain, narrow it down	Open up, expand, broadcast
Gather design ideas and suggestions from many people to build on a half-baked idea
	Announce the cRaggy event and then the rules early on
	Identify hundreds of possible datasets that would be interesting.
Select one dataset that was local, topical, accessible, and the right size	Dataset is highly “mergeable” with other datasets because it has “universal keys” (time and place)
Produce a minimal example demonstrating how to access the data	Example is important for lowering the barrier to entry
	Advertise the cRaggy dataset two weeks before the conference; encourage everyone to participate
Participants pose their own analytical question
Post entries in one corner before 9 am on conference day	Last minute entries are acceptable
	Entries have contact info, github link
	Entries posted near the food & beer
	Time in conference schedule to browse entries; everyone invited to vote
Each person has one vote to “hear more” about one entry
	Sticky notes and authors available to stimulate conversations
Three submitters contacted to give lighting talks
	Lighting talks at the end of the day to share backstory, dead ends, next steps
	Follow up on Twitter: #TidyTuesday

Overall feedback from conference participants

In the conference feedback questionnaire, several people said that cRaggy was their favorite part of the conference. Some said that the lighting talks they liked most were the cRaggy talks. One said, “I didn’t participate in cRaggy this year, BUT I LOVED IT! Please do it again!”

Feedback from cRaggy participants

I wrote to the twelve people who submitted an entry and got really thoughtful and interesting feedback from many of them.

Participants agreed that cRaggy was really fun. Sample comments were:

“It was a fun, no-pressure way to feel a bit more involved in the conference and see how other people approached the dataset.”

“I can’t think of anything more fun than exploring data and creating visualizations.”

Participants especially liked the BIKETOWN dataset because:

“[it] struck a wonderful balance of being interesting, big-but-not-too-big, in pretty good shape tidy-wise (but not perfect) and fun to explore.”

They liked the fact that the dataset had both dates and geolocation features, which made it “really easy to join up with other sets.”

Part of cRaggy’s value was that the dataset forced people to work outside of their usual professional domain. For example, two different respondents said,

“I work in anthropology, specifically archaeology, and so it was really fun to branch out to a very different kind of dataset that has time stamps in the minutes and not in the tens or hundreds of years.”

“I am a transportation professional and found myself overthinking what to do with the data set a lot [and that was good].”

One participant summarized it,

“… a big value in the event is exposing people to ideas beyond those directly relating to R code they might not come across otherwise.”

cRaggy was a way to encourage people to dive into the R ecosystem. One participant was impressed with

“… how helpful and active the R community is in Stack Overflow, GitHub, CRAN, Reddit, etc. In essence, I am super grateful of R’s passionate developers and user base (in real life and online).“

As a bit of an #rstats glutton, I was struck that one very interesting cRaggy entry was from someone who admitted that they weren’t even on Twitter! Talk about diversity!

Suggestions for next time

The original idea was to share and think about graphics, but clearly participants thought it could go further. They thought that cRaggy focused “more on presentation and communication than on coding and data analysis.” Ed Borasky put his finger on the fact that voting missed thoughtful examination of data problems that weren’t as recognizable as flashy graphics. He said

“I spent a *lot* of time cleaning the data. See http://rpubs.com/znmeb/biketown.”

Other suggestions included:

“It would be cool to easily see links to github repos from the other entrants.”

“Switch to a virtual format – the “paste on the wall” thing really doesn’t cut it.”

Charlotte Wickham had several interesting suggestions:

“It might also be nice to somehow celebrate the learning side of the event, i.e. each entrant must also provide a sentence describing something new they learnt or tried in the process of entering, that could be displayed independently of the actual entries.”

“I’d love to see some more support for those who might be on the edge of entering. I’m not sure what this might look like, but maybe a pre-conference hack event, a online forum (Slack or something), or just a few more people posting starts they’ve made or questions they’d like to answer. I’d imagine the primary focus would be on encouraging people to post something on the day regardless of where they get to.”

How can we keep the event approachable and comfortable for people across all sorts of skill levels?

We wanted cRaggy to result in the selection of people who would give a lighting talk, but participants thought that the voting could be improved.

“I would suggest that voting NOT be publicly presented via stickers. I would use a ballot box or online kind of voting system that’s anonymous to the voters and participants. As a social network analyst, I would posit that there seemed to be a preferential attachment (i.e. “rich get richer”) effect with the stickers.”

“Have more categories of winners, such as most creative, most artistic visual / graph, most last-minute (maybe), etc.”

I had thought of having different categories of votes, but never quite figured out the logistics. In the heat of the conference (after all I was a participant first and an organizer second!) I even forgot to record the number votes that each entry received. Next time I would display the entry form in advance so that people would expect to provide additional information such as

How much time did you spend?
What was your question?
What packages did you use?
What did you learn?
What would you have done with more time?

Beyond that, the cRaggy idea could evolve by somehow mapping the steps people go through as participants to a model of the steps in a data analysis project, either Hadley Wickham’s model from R for Data Science or something along the lines of John Tukey’s (1982) “Introduction to styles of data analysis techniques” ( PDF) that proposes stringing data analysis steps together along the lines of:

No responses yet

Feb 12 2018

Computing on R

Published by John David Smith under Communities of practice,R,resources,Technology

R is not just software. It’s actually a global organism that grows information: insights, discussions, work methods, human relationships, and open questions as well as a massive amount of software and all the resources that document or support it. Cesar Hidalgo equates growing information with “computing”, claiming, “It is the growth of information that unifies the emergence of life with the growth of economies, and the emergence of complexity with the origins of wealth.”

It’s a big deal that R code can compute on R code because R code is just as much a “first class thing” as the data we compute on. Both are first class objects, as Hadley Wickham points out in the rlang package documentation. But it’s an even bigger deal that the R organism computes on itself, as well. Hidalgo explores how social structures process information, “We form social structures to compensate for our limited capacities, and these social structures learn how to process information.” My argument in this post is that Hadley’s model for data analysis describes computation on data and computation on the R organism. Trying to point to instances of social structure helps organize observations of how information grew at the Rstudio::Conf in San Diego and how we are collectively learning how to process information.

You can observe a lot just by gathering tweets. I didn’t create any persistent information during the conference. I just soaked it all in, waiting to write up some reflections until I got on the plane home. Wanting to add detail, I downloaded Tweets with the rstudioconf hashtag. In that batch, I found two cool efforts to gather and analyze Tweets that were produced during the conference that are worth reading:

https://github.com/mkearney/rstudioconf_tweets (a comprehensive how-to do it post that includes sentiment analysis, network structure, etc.)
http://johnguerra.co/viz/influentials/RStudioConf2018/ (discusses “who’s influential?” You can guess who, but it’s always interesting to look at the data.)

When I looked at all the rstudioconf Tweets I had downloaded, I found many gems that had URLs in them and had been retweeted 10 times or more. (Here are the retrieval details.) Here are some of the examples of people participating in the R organism “computing on itself” at the Rstudio::conf 2018:

Import: Open, welcome. Importing data can be hard work, right? Welcoming new people and bringing in new ideas is often not recognized as the important and challenging work that it is. Mara Averick performs a valuable service on Twitter as @dataandme. Her talk at the conference about contributing to the tidyverse unpacked the process of entering the organism and becoming a contributor to it. Marco Blume talked about using Data Camp to develop R skills and data literacy in everyone at his company who wanted to skill up. But he was also brining a new idea into the R organism’s conversation with itself: what does a social transformation around data literacy look like? What makes it happen? How do you assess progress?

Tidy: Configure. Making your data tidy can be a big undertaking. “Tidy” is more or less obvious when you get there, but can be very challenging at the beginning of the process. In JJ Allaire’s keynote about deep learning and TensorFlow in R he makes a huge step toward making TensorFlow look like a native part of R. His talk included a set of tools, a gallery and a book. It represented a massive effort to bring a whole new domain into “normal R.” And in a remarkable “no bullshit” fashion, he mentioned a recent paper casting an interesting shadow of doubt over every data scientist’s valiant effort to “clean the data.” A paper on “Scalable and accurate deep learning for electronic health records” suggests that at large scale “dirty but complete” data may have more useful predictive value than one that has been “reduced by intelligent cleaning”.

Transform: Reinterpret. Di Cook explored how to take the “traditional, ubiquitous tools” like the tidyverse and ggplot2 and connect them with another set of ubiquitous tools: randomization and replication. She argued that graphs are just the result of calculations on data and so evaluating that output requires rigorous methods and more calculation. Here are her slides. And you can use and build on a package (that she maintains) to do it yourself. As if to remind us of the iterative nature of transformation & reinterpretation, Carson Sievert discussed graphs for exploration using a JavaScript library and referring to his book and R package that is more or less at the opposite end of the spectrum from Di Cook’s talk.

Visualize: Contextualize, metabolize. Jenny Bryan’s workshop on “What They Forgot to Teach You About R” was really about helping you visualize how you work and think through how to make your work processes more orderly and rational. Rethinking a common tool like Github and adapting it to a data analysis use case is a work in progress. Jessica Minnier’s workshop notes were great! I was really surprised to notice how much I just “jump outside of R / Rstudio” to do stuff in the course of a project and how that makes my work so much less reproducible. Of course there’s a book about the GitHub part of the data analyst’s work flow.

Model: Simplify, standardize. Modeling, to use John Tukey’s words, is separating a fit from the residuals. The R community makes a great effort to name things well and (particularly in the Tidyverse) to have some consistency in package APIs. Nicholas Tierney‘s ePoster on his naniar package to profile missing data is a perfect example of taking a messy subject and wrapping it up in a neat package with a neat name and a good joke about Narnia thrown in. Jim Hester’s talk “You can make your own package in 20 minutes” simplifies the process down to the bare minimum so that we can standardize code and are free to go through the Transform, Visualize, Model cycle again.

Communicate: Share, continue. The emphasis on that “last mile” of communication and sharing results is something remarkable about the R organism. It has always given me great confidence and it was much in evidence at the RStudio Conference. One great examples of that was Yihui Xie’s “Creating Websites with R Markdown and blogdown”. Of course having enough extra compute power to crack a good joke every other slide is also personally inspiring to me. Petr Simecek‘s collection of all the conference slides organizes the whole thing and sets the ground for the next cycle of computing on R. I already got a ticket for next year’s conference.

Recommended reading: Cesar Hidalgo. Why Information Grows: The Evolution of Order, from Atoms to Economies. Basic Books, 2015. http://isbn.nu/9780465048991

No responses yet

Jun 12 2017

Feedback from the Cascadia R Conference participants

Published by John David Smith under community data,Evaluation,Event design,pdxdata,R

I helped organize the Cascadia R Conference on June 3, 2017.

About 190 people attended the all-day conference. I volunteered to do the conference evaluation questionnaire and to analyze the results. We adapted a questionnaire that the CSV, Conf, 3 had used and used a Google Form to gather feedback at the end of the day. We got 59 responses or 32% of the conference participants, which seemed like an encouraging response rate.

The week after the conference, the organizing committee (which included Chester Ismay, Jessica Minnier, Lilly Winfree, Oliver Keyes, Scott Chamberlain, Ted Laderas, and me) chatted about the open-ended responses and discussed how things went in our Slack channel. That actually seemed like an excellent, informal sense-making strategy for when an entire committee is data-oriented. This post combines some of their reflections (without specific attribution of specific comments or contributions from them) with a more systematic summarization of the data that I’ve done afterward. Naturally I had to horse around with the response data in R and produce some graphs to depict what I thought was important. Ted Laderas Wrote up his reflections on the whole project in another post.

In scheduling a full day’s sessions – on a Saturday – one of our goals was to bridge across communities, specifically geographic communities (north and south along I-5). We succeeded beyond our expectations. About a third of the respondents came from more than 50 miles away with a bunch of people from more than 100 miles away.

Respondents asked for more social time: we know but probably always need to be reminded that R users are very social. (That’s an essential ingredient of R’s secret sauce.) Definitely the day’s schedule was action packed, which we thought was a good thing. But, as one participant said, “The lightning talks were kind of rushed”, as they were supposed to be. When asked what their favorite thing was about the conference, 22 participants said “Workshops!”:

Workshops – 22
Lightning talks – 10
Meeting people – 7
Keynotes – 7

Two days might provide more social breathing room (for everyone except the organizers). However a two-day format might be a challenge for people driving to Portland from far away. In the future we could consider having a full “pre-conference” day for workshops and a separate day for talks. (We didn’t really know how popular the workshops would be.) The useR conference uses the “pre-conference” structure for workshops. They had 15 minutes for talks in 5 different locations with 3 minutes in between to transfer between talks so our setup for the talks was in line with their practice.

In the future we should just say upfront that this is not a traditional conference. “To keep the cost of registration low we sacrifice <this> and <this> and <this>.” And one of those is that you are on your own for housing. A longer conference might tempt us to try to deal with an “official conference hotel”, but that would probably become a big headache.

Mimicking other conference formats and organizations like useR or the csv,conf is an important strategy for a small group like ours. Afterward we noticed that the Open Source Bridge has the volunteer thing figured out, for example. Some of us are going to that to pick up tips and strategies.

Here are the histograms from the new R skimr package, which suggests that everybody thought the location was great, a few people had problems with WIFI, and keynote topics and overall talk quality was very (but not completely) positive.

With really outstanding support from OHSU, rOpenSci, and Rstudio we did pretty well.

Interesting that we had a good mix of R mastery — from beginners to masters. People who identified themselves as 3’s or 4’s were unusually enthusiastic about the keynote topics.

Respondents who said they thought participating in the conference would lead to future collaborations were more likely to say they would be willing to help organize (or volunteer) or that maybe they would be. The cross-tabulation is below. This pattern is clearer when you look at a Tukey median polish with residuals in italics and and the fit in bold.

		Willing to help organize?
	Raw data	Yes	Maybe	No
Met people I'm likely to collaborate with	Yes	14	5	3
	Maybe	5	14	12
	No	1	2	2
	Median polish				Row fit
	Yes	10	0	-2	0
	Maybe	-6	2	0	7
	No	0	0	0	-3
	Column fit	-1	0	0	5

Here is the code for fitting a median polish:

xtab <- xtabs(~ will_collab + help_org, data = feedback)
medpolish(xtab)

Also, respondents who said they were early or in the middle of their careers more frequently said they would be (or maybe would be) willing to volunteer or help organize a conference in the future.

Some selected comments from the open-ended responses:

Amazing conference! love the one day thing, the schedule, and the location!
I think the cheap cost and central location really contributed to the crowd size.
It would be great to have a keynote about ethics in data science.

Final suggestions seemed to be all over the place:

Less talks; more interactive activities
Longer
Longer breaks
More lightning talks, fewer full talks!
More talks and less workshops
Fewer, more in-depth talks
Explicit tracks + more than 2 tracks, maybe? “R in Bio” / “Stat Computing” / ???
Have three workshop levels next time – total beginner, intermediate, and advanced.
Notify the food carts ahead of time; they were really unprepared to be so slammed, and I think giving them a heads-up would’ve helped everyone

Finally, here are the topics we asked respondents:

Satisfaction with Registration
Satisfaction with Location/Meeting space
Satisfaction with Timing: Please rate the overall distribution of talk schedules and unstructured time
Satisfaction with Keynote topics
Satisfaction with Talk quality
Satisfaction with Conference agenda overall
Satisfaction with Quality of the wifi/internet access
Satisfaction with Snacks and drinks
Please add additional comments on the overall conference organization.
Where are you in your career?
On your way to R mastery, where are you now?
How many miles did you travel to get here?
Do you think you will form new collaborations as a result of attending?
Overall, the conference _________.
I considered the conference a _________.
I would like this conference to be held _________.
Are you able to help organize or volunteer next time?
If you’re able to help next time, please add your email.
What were your favorite parts of the conference?
What would you change or improve about the conference?
Please provide any additional comments below.

Being able to participate in the whole process, from hatching the idea to helping it actually happen, to participating in the whole conference and finally to thinking about what worked and what could be improved afterwards in this blog post was a great experience. I’d happily do it again!

I’m even more convinced, if I needed to be, that the sociability and community-orientation that’s baked into R is profoundly important. When we are struggling with a bit of code, gnarly data, or a graph that doesn’t quite look right, we tend to think of ourselves as working alone. R provides many reminders that we are not alone and that data analysis is a social act.

No responses yet

Feb 06 2017

Dogged pursuit of data quality and use

Published by John David Smith under pdxdata

Data Dogs notes – Jan 2, 2017 – https://www.meetup.com/Portland-Data-Science-Group/events/236078570/

Challenge question: describe three measurements that you are familiar with at each of these three levels 1) personal, 2) very large scale, and 3) in between. For each level describe what’s measured, why, how and and who does the measurement. How does it all add up to good or bad measures?

Group 1 report-out

Personal examples:

Observing the health of a great aunt
Health data such as using a fitbit and syncing it with a phone
Tracking blood donations, keeping track of which arm is used. Noticing how being lazy about collection enters into it.

Large scale:

Google in general.
Noticing how adverts follow us around. Avoiding them by using a VPN or going incognito. All of it makes paranoia more admirable.

In between:

Have a customer that generates lots of data: so much that paper doesn’t work anymore. Trying an electronic version. Find that clients collect data but don’t analyze it, don’t want to hear conclusions. They really want to hear “their story” reflected.
Similar to “egoless programming,” it’s a challenge to try to disassociate yourself from the product. So can look at the data.
Example from the Kim Kaners paper: not collecting metrics because data can be used against you.
Quote from a client: “You can’t trust the data because things have improved [since it was collected].” If the business going well, maybe you don’t need to improve?
Which is most frustrating: getting clear, clean data or getting clients to listen or use the data?
Like Web Server statistics. You can control: what’s on the website, how to advertise. Control what’s to be measured. Where allow advertisements.

Group 2 report out

A difficult topic: “Usually people have some role in measurement. So psychology always has an impact. Feelings matter in measurement.”
Another topic: accuracy vs precision. Measure something vs estimate. Is every measurement is a bit of an estimate?
Gripe about data we depend but where we can’t influence measurement methods: people don’t reveal their methodology, even though it’s important.
Spent time on last: formalized process, change methods, be more precise, summarize. Subjective data. Tracking assessments of hires. Categorical data. Journal entries. Different people using different codes. Severity codes. What does a 4 severity mean?
When you formalize the process, it seems like you spend a lot of time on the measurement.
GitPrime.com: “Data-driven visibility for modern software teams.” A new githost company that sells you statistics about your developers. Measuring programming productivity. “Software teams have never had a shred of hard data to bring into discussions. Forward-thinking organizations are moving beyond subjective measures like tickets-cleared, and evaluating their work patterns based on concrete data. Knowing immediately when an engineer is stuck or spinning their wheels helps managers do their jobs in a way that has never been possible. Measuring things like churn, codebase impact, and true cost of paying down technical debt, allows engineering to demonstrate how they’re meeting the business objectives at hand.” cdn2.hubspot.net/hubfs/2494207/Content/whitepaper/GP-DataDrivenTeams.pdf
Goodhart’s law: “when a measure becomes a target, it ceases to become a good measure.”

Group 3 report out

Personal:

Body weight is a big one.
Getting things done. Todo lists. Tracked or not.

Large scale:

Statistics around gun violence. How get the data, understand it. Measurement problems: there are laws against measurement! Scraping the net for data. Washington Post efforts to collect data about gun violence.
Data about electronic components not shared.

In Between:

Own data science project: sampling size. Counting criteria. What measuring affects accuracy, calculations. More samples means more accuracy but also more time gathering data. Facing the tradeoffs gathering cell counts & brain samples. Counting standards. A project on dementia. Looking through a microscope. Cells picked out. Candy turtles.
Trying to get out of manually measuring (or manually entering data about) anything, mechanising all data collection. Practices vary by industry. In the electronics industry the unit is “defects per million”. Different in a new ERP system.
Economic incentive to do as little manual entry as possible: it takes time and money.

Group 4 report out

Personal:

Measuring the amount of time spent executing a job during the day. Getting at “what’s personal efficiency?” How measure it? Difficulties and vagaries of measurement. Have to remember to measure it throughout the day. Bottom line: billing clients with it.
Anecdotal (“I had a bad Tuesday”) vs. data-driven decisions.
When business doubled, who is wondering about capacity in the factory?

Variability:

How businesses handle item/master data is frequently messy. People use a field in the system for very different purposes than what was intended or for what others use it. More communication can lead to more or standardization. What a field “means” and how you are supposed to use it. Variant uses of a field can also be clever.

An example of data variability reduction: Subway, Inc. studied supplier data provided by its franchisees. They found that 80% of supplier data was inaccurate across the whole supplier chain. Now they get purchase amount data from supplier and push it to the individual franchisees to standardize reporting about franchise performance.

We had a general discussion about assessing causation when direct experimentation is impossible using the “Hill Criteria” that Roger Peng mentioned in a recent podcast: https://en.wikipedia.org/wiki/Bradford_Hill_criteria

Thomas E. Kida, Don’t Believe Everything You Think: The 6 Basic Mistakes We Make in Thinking

We prefer stories to statistics
We seek to confirm
We rarely appreciate the role of chance and coincidence in life
We can misperceive our world
We oversimplify
We have faulty memories

No responses yet

Dec 29 2016

Download our book for free

Published by John David Smith under Books,Digital Habitats

It is hard to believe we finished “Digital Habitat: Stewarding technology for communities” back in 2009 (after working on it off and on for 5 years). Seven years later the book is still selling on Amazon. And we hear that people still actually use it, even though the whole technology ecology has changed so much.

As of December 2016, we have decided to make the full publication available FREE! So now you have the following electronic options. (But feel free to keep buying too!)

Electronic versions are available in Zipped files here:
- For Amazon’s Kindle reader (.mobi)
- For Apples iBooks reader (.epub)
The book’s Digital_Habitats_TOC in PDF form, the full index and a glimpse into some sections of the book to help you get a better idea of what it is about.
Worksheet and tools that are related to the book : Digital Habitats Index, Community orientation Spidergram tool, Diagrams, Action Notebook

No responses yet

Sep 13 2016

Community – organization difference

Published by John David Smith under Communities of practice,SNA,talks

Since I did a talk at CHIFOO last May, I continue to think about situations where a community and an organization coincide or are closely intertwined. In comparison with how community and organization interact in other circumstances, it’s the churches and meditation groups depicted on the right that highlight the differences between the two regimes most clearly. These different settings have been on my mind for years.

But I think that the coexistence of organizational and community regimes is the norm, even though the contrast is clearer is some circumstances more than others. Here’s a table that’s evolving to describe how I think of the difference between the two regimes:

	Organizational logic	Community logic
Rests upon	Civic or economic system	Social fabric
How roles are set	Contracted / appointed	Negotiated / evolved / reputed
Surrounding membrane	Coase’s Theorem	Legitimate peripheral participation
Kind of entity	A legal person	A tradition
How mapped	Hierarchical org chart? Chart of accounts?	Stories, texts, places, events
Value created	Adding to a value chain, delivering a product	Reifying community, practice, domain
Communication structure	Go through channels	Many-to-many, some-to-some
Time investment	Depends on mandate	Depends on perceived value

What seems tricky to me about thinking of these two in comparison to each other is that we constantly switch from one regime to the other — usually without thought or awareness.

2 responses so far

Aug 25 2016

Tools for a Hack Oregon project

Published by John David Smith under KM,Technology,technology_stewardship

A Hack Oregon project is fun and you get to make a contribution. In a project like the Oregon Hunger Equation last spring, everyone is figuring out both how to have fun and collaborate during the length of the project at the same time. The fun part means you get to improvise and invent, hang out with a bunch of cool people, figure out how your skills fit in and learn a lot. The contribution part means you need to focus and collaborate effectively, figure out how to make what you’ve got to offer a value-add for the group and for people who use the final product — beyond the demo.

Here are some suggestions for figuring all of that out, mixing experience in a Hack Oregon project in the Spring of 2016 with technology stewardship ideas like those I wrote about “Technology Stewardship for Distributed Project Teams” (a book chapter from a few years ago that’s more theoretical). This post is my attempt to make sense of what went on in one project and to prepare for the next one.

Some “think abouts”

Nothing beats testing what works on the spot. Forget about these suggestions if they don’t work for you or your team (even if they are based on plenty of previous experience).
Working together face-to-face or side-by-side is obviously the best way to communicate. But all the other tools help make the face-to-face meetings work better because the results and the connections don’t end when everybody goes home.
Like it or not, you’ll end up using a bunch of tools, some of which may be unfamiliar or unappealing to you. This post tries to help you figure out how to use them together and which one is the best to use for which purpose.
Hackathons are drop-in games: who shows up is who shows up. People can’t make it to all the meetings and often enough they have to arrive late or leave early. Think of how to help them get back on board with the project. Ask them to reciprocate.
People have very different styles of making sense of what’s been said or done face-to-face or via the tools the group uses. And they remember it differently. Figuring out how to communicate on a team is productive and can be fun, too.
Lots of different people make lots of partly-organized efforts to deal with hunger in Oregon.
There are other teams working on the same or related projects. (Like you, next year.) Think of how you can make your work visible and useful to those people. Stories are the universal way of communicating with people across time and space.
Figure out what other people know and how you can help them & how they can help you. This can’t be scheduled very well in advance, but it does happen.
All the tools that the team uses need to be connected to each other. Usually this doesn’t happen automatically; you need to post the Google Folder URL in Slack, for example.

Face-to-face meetings

Here are some suggestions for making the face-to-face meetings a solid basis for the other tools & ways of communicating.

Think about how to make the meetings work for you by rearranging the seating or the work tables if necessary.
Make it a regular thing at the beginning of each meeting to review what happened at the last meeting (or what’s happened in between). People will remember different things.
Stating a session goal is useful sometimes. Agendas are your friend.
Latecomers and early-leave-takers are a reality. How can you make it easier for them?
When a meeting turns out to be mainly independent work sessions, it helps to spend a few minutes announcing what you’re up to at the beginning (and maybe at the end).
Make friends. Remember that everybody feels like they are out of the loop.
Speak up when you don’t know what people are talking about. An acronym, a work step, a source, etc., etc. These are very complicated projects. Nobody can know it all. There’s always someone who feels completely out of it at a meeting. How can you bring them in?
It’s good to end a meeting with a review of what people need or are working on between meetings.

People directory

Starting with a simple directory of who’s working on the project, think of how you can help people connect with each other:

People have different handles (usernames) on different platforms — which faces, names, nicknames, email addresses, handles, etc., go together is not obvious to someone else that needs to get in touch with you.
Someone needs to set up a directory that everyone can find and use from early on in the project. People need to enter their own information for maximum accuracy and choose what they are comfortable sharing.
Suggested information for a people directory:
- Name
- Email address
- Google ID (for edit access, etc.)
- Phone
- Slack, Twitter, Github, and other handles
- Skills or roles? Background? Think of how other people can use or will interpret what you share about yourself.
Figure out how to make the directory an easy reference for everyone — especially people coming on board.

Directory of project resources

As your project grows in complexity, there are more resources that need to be coordinated. Some people will just carry all of it in their head, but it makes a big difference to most people if there’s one document that has all the essential details.

Other projects or external resources might provide context that people on your project will need sooner or later. Nobody knows all of the context, so it’s important to share.
A Directory of project resources could include:
- Project goal statement
- Slack team space URL
- Github repository URL
- Google Docs folder URL
- Meeting location & time
- People directory URL
- A chain of custody log for the data
- A customized version of this document?
Remember that project resources evolve over time, so somebody needs to update the resource directory as you go.

Slack

Slack is for near-real time communication. You can think of it as a chatty project log with a long memory.
It runs on your phone, desktop, and the web.
Slack makes it easy to post and announce stuff you have to share. It’s not such a good place for stowing resources, even though it’s easy to post stuff there. It’s best for sharing pointers to resources.
On Slack, your unique identifier is your email address, not your Slack Handle, which you can change whenever you want. It helps if your slack handle is a realistic name, not an obscure handle that’s unique, say, on Twitter or Github. For example, “skdmknd13” might be unique and something you identify with, but it’s not so great for conversation. Think how odd it sounds when someone says, “Hey @skdmknd13 that was a great job!”
It’s easy to set up different “channels” and they can be useful to separate different parts of a project or threads of conversation, but it’s also easy to have too many of them.
May need to invite people joining a project on to a specific channel. It’s nice to welcome late comers, too. It’s very useful to give new people feedback: “I see you; I see what you posted; it makes sense. Thanks.”
Some people use slack at work, others don’t. Response time varies according to whether people appear to be logged on or not. Notifications by email are easy to setup and manage.

Google Docs

Google Docs is the ideal environment for storing shared text like meeting notes, references, documents, diagrams, and project rosters. Google Spreadsheets are an effective way to keep small datasets handy.

Some people use Google Docs a lot and take them for granted, while others are completely mystified. Figuring out that once they open a folder, they can add it to their own “Google Drive” turns out to be a big step in becoming comfortable with Google Docs. Looking over each other’s shoulders to see how people use Google Docs can reveal tricks you didn’t know about or things that other people need help with.

If a few people can get a practice of taking notes during the beginning of a meeting, in discussions, or at the end, it will go a long way to recording important contributions and resources. Having more than one person keep an eye on the meeting notes helps a lot so you can take turns, supply extra detail, and correct each other’s spelling.
A big drawback of Google Docs is that you can only search through the titles, not the contents of a document unless you open each document. Therefore naming documents consistently and putting them into folders is really important.

Github

Github is important sooner or later since version control is essential in a distributed, collaborative software development project.
Github can be mysterious for people who don’t use it regularly and the learning curve can seem a bit steep, despite good books that are free like Pro Git. A little orientation at the right time helps, since people have very different skills re Github. Without a bit of educational effort there’s a “throwing it over the transom effect” in the sense that once project activity focuses on code development, some people step back too far and ignore what’s going on.
Github can be a really good way to publish documentation written in Markdown, like this piece on “How to share data with a statistician.” It’s not so good for real-time updates, but it works well when you want to leave something visible for posterity.
Like any tool, Github can be clumsy when used for the wrong purpose. It’s not really a good data repository, even though it can store and display csv files. Small datasets are better stored as a Google Spreadsheet because they can be easily sorted or filtered and readily downloaded to a csv format.
Depending on the size and longevity of a project issue tracking in Github can be a useful way to track bugs and manage fixes. Using Waffle IO to visualize issues à la Trello an make issue tracking more accessible: https://waffle.io/hackoregon/Hunger
A good project README.md is important.

Email

Email is the universal fall-back communication technology (after face-to-face).
Some people are sensitive about sharing their email address.
Good for one-to-one contacts and for blasts to the team or the whole world.
If people have their Slack account set up correctly, Slack will send them notifications by email

Photos

Photos are good souvenirs and great for showing people how much fun it is — for next time.

Thanks!

Emily Logan, Aaron DeVore, and the whole project team contributed in one way or another to this piece.

4 responses so far

Mar 28 2016

Schmoozing with Portland Data Scientists

Published by John David Smith under Communities of practice,community data,R,technology_stewardship

Here are the topics that Portland R Users say they are interested in:

I’m interested in those topics, too. And the several other data science MeetUps have similar topic profiles. But when people ask to join the Portland Data Science MeetUp, for example, they say they are seeking things like:

Networking.
Meet people with similar interests.
Better sense of the Portland data science community.
Meet more people in the community, and learn what types of data work goes on in Portland.
Meet other people with similar interests.
Meet colleagues and hear about their best practices, projects and approaches to solving problems in the data science space.

That’s what I want, too. But most of the Portland data science MeetUps seem to consist of sitting in front of a speaker who’s in front of a screen talking to a group of people who are looking at their computers. Not that much chatting with the person sitting next to you. How can a local, mainly face-to-face group find a useful function in a larger learning ecosystem that includes (for R, at least) Twitter, Stack Overflow, R-Bloggers, various mailing lists, etc., etc.?

Some of the event interaction behavior that I’m seeing is venue-related, where the room layout and seating encourages limited cross-talk and mostly passive participation. But what the MeetUps platform itself provides is somewhat lacking as a community platform. It has some opportunities for discussion and interaction online, but postings from members seem to be mainly about what an informative and interesting presentation that last sessions was.

Stepping back to look at the Portland’s Meetup scene more broadly (all my R code for data retrieval is on Github and a vignette is here) shows that there are lots of them and they come in all flavors.

By far the biggest groups are in the “Outdoors”, “Social” and “Singles” categories. “Support”, “Moms and Dads”, and “Fashion” are the smallest groups. Obviously most of those groups are not sitting looking at computer screens when they meet. But as a whole MeetUp groups make for a fascinating community laboratory. It makes me wonder what reasons are there for a group to grow very large or for it to stay small and differentiate from other similar groups? Here’s a look at 5 groups in the data science area as they’ve grown over time:

The Python MeetUp group is big for several reasons:

it’s the oldest,
the language is used for data science purposes as well as for programming more generally
it has a mix of large and small meetings (based on the number of RSVPs; R Users and the data science groups have a similar mix),
it has had regular meetings with consistently large RSVP numbers,
no interruptions (like the R group)

What’s going on here? Although I’ve found that go out for drink afterward is exactly where networking (and a lot of the learning) go on. To find out, I got involved with a small group that was working to bring the several data science MeetUps closer together, since there is a lot of overlap in the topics they cover. We’ve met in bars and coffee shops to talk about a federation of MeetUps. Of course during our meetings everyone had to stare at the computer (including me, but my community background compelled me to step back and take a photo of the group). In the photo most everyone is looking at a Google Doc where we are writing a collective document about how to move our several MeetUps forward individually and together.

One strategy that we came up with was to set up a Slack Team room where we would expect more chatting could take place, even during a meeting. However, to create a way for MeetUp Group members to join a Slack team space involved two other platforms: Google Docs to do our planning and Github to create a common website for the federation of MeetUp Groups.

Here is a re-cap of the functions and issues that I see in the use of these four platforms.

Meetup.com is oriented toward “getting together”. It has good group discovery, an easy way to affiliate with (or join) a group, good meeting notification, a nice way for members to link to their Twitter and LinkedIn pages, an RSVP function that allows for meeting organizers to deal with smaller venues, some linking with other members, and a funny “attendance” function (where you click on a “good to see you” link, in effect indicating who actually showed up at a meeting). It has some features that limit a community’s interactions, including an orientation toward “the next event” rather than a topic orientation (i.e., “what we know” or “what we have learned”). MeetUp has a limit on the number of characters in a comment, so meeting notes can’t be very long at all. It also shields member identity carefully by making it difficult to share your email address through its individual messaging channel; in effect it tries to keep you tied to Meetup for member-to-member communication.

We decided we wanted to add Slack.com as a data science federation platform because it’s oriented toward “being together” (or at least “hanging together”, or at least “chatting together”). It makes it easy to have multiple chat channels, has good (easy to control) notifications, makes it easy to drag & drop documents and files into a channel, has excellent search within a team space, feels “private”, and supports closed groups within a larger team structure. It also works nicely on a smart phone as well as on the web. In addition to the fact that a Slack team room is not being discoverable via a search, a it requires users to be “invited,” which could have become a labor-intensive job for a loose group like us.

We found ourselves using Google Docs to discuss and plan how to “federate” the several data science MeetUps, because Google Docs are oriented toward “thinking together”. Being able to share documents, control who can write to a document, and have multiple people write in a document are all very useful functions. Although Google Docs work well for a small leadership group, they aren’t so effective for communication within a very large group, partly because of the very document structure.

Although github.com is basically a “coding together” platform, it also turns out to be a very social platform. Github pages was the easiest way to set up our data science federation website: http://pdxdata.org/. We were able to borrow a trick from The Ann Arbor User Group for automating the Slack Team invitation process. Github is quite social for its limited technical user base (a STAT 545 class at UBC even uses it for class Discussion).

No responses yet

Sep 24 2015

R and Google Spreadsheets (and the context)

Published by John David Smith under code,R,technology_stewardship

R and Google spreadsheets are powerful partners for verification, exploration and delivery of data to people in communities and organizations. Recently I wrote a convenience wrapper around the gs_upload function in Jenny Bryan’s wonderful googlesheets package. But it takes some context to explain how I use it the way I do and why I think it’s important, so this post has a little bit of R code and a lot of context. It expands on a Lighting Talk I gave at the Portland R User Group last Wednesday.

In general I’ve found that getting tables or data frames from R into a document is fairly clumsy. There are several packages available that are designed for the purpose, but to me they seem like more trouble than they are worth. Often they are easy on the R side but working with the output on the word processing side ends up being laborious and clumsy. However, once a data frame is uploaded to a Google spreadsheet, cutting and pasting some or all of it into a document is very easy, fast, and it doesn’t add junk that you have to remove later on. Apart from that, a Google spreadsheet is sometimes even more handy than RStudio’s data viewer (which is saying a lot) for inspecting and pondering what’s actually going on in a data frame.

But the real reason for the small bit of R code that I am sharing is that it helps of with the verification, exploration, delivery, and use of data in a social context. A Google spreadsheet makes it easy to control access to your data by sending someone a URL and simply giving them access to it. It’s very easy for two people to look at it together, say, on a phone call. One or more people can annotate, sort, filter, plot, or pivot the data in it to their hearts content (depending on their level of skill). If the conversation involves annotating it, a Google spreadsheet is nice because you know that everyone is looking at the same version. In addition, it actually works as a delivery method in that the data can then be downloaded into whatever environment someone wants.

The context of data and of analysis

But back to the subject of context. Back in the 1970s I read a lot of books and articles by statistical visionary John Tukey. (My copy of EDA is one of my most thumbed, battered and annotated books.) I can’t find the exact quote, but I’m sure that somewhere Tukey said that “Domain knowledge is essential for data analysis.” I found the statement to be particularly troubling because I didn’t think I had any (certainly not enough) domain knowledge. I wanted techniques for data analysis that would give me insight in domains where I was pretty sure I was clueless. I wanted to be able to analyze data without knowing much about the context. Disappointingly, it turns out that, despite what you might think judging by most of discussion about them, statistical techniques and software are no context-free silver bullet! In retrospect I was learning to appreciate the importance of domain knowledge when in those years I would always take a sick day to read an entire SUGI proceedings the day after it arrived — in part to check out what contexts framed the work of other SAS users.

As I think about that issue decades later, it seems to me that domain knowledge is indeed involved in at least these facets of sense-making with data:

Data collection – where and how the data was collected and what it means
Coding – how it was recorded and coded (and why and how to decode it)
Retrieval – what kinds of ethical and technical issues there are around getting and using the data
Munging – the pain and pleasure of cleaning, combining, and organizing the data for use
Exploration – having some intuition as to where to look and how to look at it
Analysis – understanding what kind of data reduction or analysis is relevant or customary
Value assessment – judging the value of the data and the results of an analytical effort
Communication for action – communicating the results to people who can take action

These contextual facets (I just made up or unconsciously stole this list, and would love to hear about your list) seem very important to me. Part of my motivation here is to argue that we need bring more context into R Group discussions and presentations. In Why Information Grows: The Evolution of Order, from Atoms to Economies, César Hidalgo (2015) points to this same issue:

It is hard for us humans to separate information from meaning because we cannot help interpreting messages. We infuse messages with meaning automatically, fooling ourselves to believe that the meaning of a message is carried in the message. But it is not. This is only an illusion. Meaning is derived from context and prior knowledge.

Although we manipulate information with R, what we care about and what we seek are the messages that it encodes. Therefore we always need to be aware of context, drawing on whatever domain knowledge we have that is is consciously or not turning our information into messages.

Communities and their technologies shape data and message

Slow forward 35 or 40 years and I do have some domain knowledge about stewarding technology for communities. How communities, organizations, and technology all interact has been a long term interest: their interaction is important and determines what data that the community might have or need about itself. I think each item on my list of sense-making facets interacts with issues of community, organization, and technology.

In following diagram from a blog post 2 years ago, the red lines represent organizational boundaries and the ochre lines suggest community boundaries. I’m still fascinated by the two examples on the far right of this diagram, where an organization (linked “personbytes” of knowledge, in Hidalgo’s terms) is present and plays a role but does not contain the community: CoP-org-configurations-annotated

Although the landscape changes constantly, combining technologies, understanding how they support memory practices, how they do or don’t work together, and how they support being and learning together are a big challenge for communities and for organizations — just as much as they were when we were were writing Digital Habitats. Here are some examples of the data issues that pop up for free-standing communities that depend on technology for their existence but are not contained by an organization:

The Portland R Users group is just fine using Meetup for chit-chat, scheduling, and sharing resources. A nice meeting room like the one that Simple has provided is a key resource. And of course people’s willingness to give talks is what keeps it alive. Although members may be affiliated with an organization, the community’s “organization” is the determination of one individual who keeps the conversation going (and brings pizza!). Meetup’s clever “Good to see you!” follow-up emails accomplish an individual purpose (recognizing and greeting other people) at the same time that they gather data about the community: attendance and social network information. The data on a participant’s ratings of a session or who they greeted are available through an API and may be used by Meetup Inc., but it is not readily available to the community itself. The community’s “organization” is supplied by Meetup.
An open source project like OpenRefine relies on Github for its front door, to manage its code, its binary downloads, and its documentation wiki. An email list, a custom search engine covering community blogs and a Twitter hash tag complete the community’s basic technology infrastructure. Its “organization” basically consists of the list of contributors. Although that is preserved, imagine how much community history was lost as the tool transitioned from its original creators, Metaweb Technologies, Inc., when they were acquired by Google, which then spun it off as a community-supported product. Communities are often capable of holding long histories but it’s not automatic, nor necessarily supported by community infrastructure.
KM4Dev sprawls over an email list, a wiki, a Ning site, a G+ community, a hash tag, group meetings, Skype group meetings, and other platforms and venues that are unknowable unless you were there or heard about them from someone who was there. It’s “organization” is constantly in question but it has managed to survive a long time and obtains funding to study itself and accepts donations even though it isn’t a formal entity. Each of its platforms has a different way to keep track of a member and her activity so data integration is very difficult. That means the community can’t use its data to argue to the big employers where its members work that KM4Dev is a key part of their professional infrastructure. It may be that its anti-organization stance, which is reflected in the loose coupling between its tools, is a response to the over-organization of large development bureaucracies.

These three free-standing communities are viable and productive with a minimum of organizational structure. Their data resources mostly serve their needs and are in alignment with community energy. However, as communities grow in size,complexity, ambition, or age, they need some kind of organization at their center (as depicted on the right in the diagram above). The churches, temples, and meditation centers that I’ve been studying and working with over the last 5 years all eventually need some kind of organization to carry out administrative functions on behalf of their communities. The question is always: how much “organization” is enough? — or too much?

Collecting data and using it depends on the organized activities that typically happens in an organization. The arduous task of collecting complex data, using it for diverse purposes, across time, across platforms, and across diverse social contexts requires even more organization. But what is depicted, what the messages are about, is often about community participation — the voluntary and more chaotic side of life that can’t be captured. The question today is how much data resource is enough? — or too much?

To get at the messages in the information that one of these organizations keeps, we need to remember that the organization and the community jointly frame data gathering, storage, integration, use, and meaning. To understand data issues, we have to consider questions such as:

Balance between community and organization: are the ways that one serves the other effective and well-understood? are parts of the community or the organization more important (or out of reach)?
Life span and length of memory: how important is it to remember participation, adherence or contribution? how far back does history go? how much change is required to “stay the same”?
Social context and diversity: what locations, languages, or different purposes are represented? how consistently are messages encoded and decoded?
Technology dependency, diversity and integration: what parts of the organization or community’s life need to take place on technology owned by the organization itself versus platforms like Facebook or LinkedIn? how spread out over multiple technologies is the community and how important is integration?

These questions might sound ponderous if we’re just talking about one query or data project but I think they emerge when we do more with data resources in a community-related organization. We need to deal with all the traditional organizational issues as well as the kind of sense-making issues that communities are always engaged in.

From my R data frame to a distributed Google spreadsheet

Moving information from R to a Google spreadsheet is fairly straightforward. Taking care to transfer the messages requires some extra steps. For example, what’s a convenient, clear, and consistent name for an object in my R code is not necessarily helpful when delivered to someone else. Here are some changes I make to names as I upload an R data frame to a Google spreadsheet:

A terse data frame name becomes a longer and more descriptive spreadsheet name
Variable names are expanded to be more descriptive column headers
I never use capitals in variable names but I find that they make column headers easier to read
I replace underscores and dots in variable names with spaces, so that column headers consist of words that easily flow into more than one line

Here is an example where I upload a small data frame to a Google spreadsheet.

Here’s a snapshot of the resulting Google spreadsheet:

Once the data frame is in a Google spreadsheet it’s helpful (and very easy) to:

Freeze the first row, so that column labels don’t disappear when you scroll down
Bold the first row, so it stands out clearly
Center and flow the text in the first row, so that the longer column header isn’t cut off
Set the width and formatting of each column appropriately (e.g, set decimal places)
Turn “filter” on to allow subsetting at the click of a button
Sometimes, specific rows or columns are set to a different color to call attention to a specific issue

Getting to the community’s message

So what domain knowledge that is relevant to data about, by and for a community with an organization at its center? Despite years accumulating domain knowledge about communities, organizations, and data analysis, there is a lot that I don’t know about the creation and use of the data I’m interested in. Working on behalf of some 250 centers of different sizes, nationalities, and levels of maturity around the globe means that even narrowing my focus to one database, there is not one context but many. On the data creation side, I find that there are different data entry practices and the volunteers who enter the data turn over regularly; learning about the data and its context is an ongoing process for new volunteers and therefore for me. Despite common intentions, many inconsistencies and blind spots aren’t visible until people can see the results of their work in a larger or comparative context — like a handy Google spreadsheet.

When a data frame is a report that involves greater complexity than just a simple list, it requires additional explanation such I suggest in the following example. Hints and suggestions expand on this column-by-column documentation:Although some of the data frames I upload to Google spreadsheets are single-use, look-once, copy-once, and throw away, some of them are longer-lived. When R is joining information from many different sources (e.g., MySQL, Google Analytics, MailChimp, web scraping, etc.) or is replicating a report many times over, a complete description of the data and its context is worth the time.

But nothing I write about the data is the last word. Eliciting knowledge from sense-making partners in a community and its organization is a key step in making the data resource useful. A Google spreadsheet seems like an ideal vehicle for negotiating and understanding the different assumptions and meanings that transform the information that I have into a meaningful message for my partners. I find that to them “information” is boring, but messages about people, processes and possibilities are interesting because they can lead to growth and benefit.

No responses yet

« Newer Entries - Older Entries »

Community and organization intertwine

cRaggy 2018: design, feedback & reflections

This year’s cRaggy event

Design to Balance Opposing Factors

Overall feedback from conference participants

Feedback from cRaggy participants

Suggestions for next time

Computing on R

Feedback from the Cascadia R Conference participants

Dogged pursuit of data quality and use

Download our book for free

Community – organization difference

Tools for a Hack Oregon project

Some “think abouts”

Face-to-face meetings

People directory

Directory of project resources

Slack

Google Docs

Github

Email

Photos

Thanks!

Schmoozing with Portland Data Scientists

R and Google Spreadsheets (and the context)

The context of data and of analysis

Communities and their technologies shape data and message

From my R data frame to a distributed Google spreadsheet

Getting to the community’s message

Recent Posts

Site directory

Topics

Tags

Archives