Feb 12 2018
Computing on R
R is not just software. It’s actually a global organism that grows information: insights, discussions, work methods, human relationships, and open questions as well as a massive amount of software and all the resources that document or support it. Cesar Hidalgo equates growing information with “computing”, claiming, “It is the growth of information that unifies the emergence of life with the growth of economies, and the emergence of complexity with the origins of wealth.”
It’s a big deal that R code can compute on R code because R code is just as much a “first class thing” as the data we compute on. Both are first class objects, as Hadley Wickham points out in the rlang package documentation. But it’s an even bigger deal that the R organism computes on itself, as well. Hidalgo explores how social structures process information, “We form social structures to compensate for our limited capacities, and these social structures learn how to process information.” My argument in this post is that Hadley’s model for data analysis describes computation on data and computation on the R organism. Trying to point to instances of social structure helps organize observations of how information grew at the Rstudio::Conf in San Diego and how we are collectively learning how to process information.
You can observe a lot just by gathering tweets. I didn’t create any persistent information during the conference. I just soaked it all in, waiting to write up some reflections until I got on the plane home. Wanting to add detail, I downloaded Tweets with the rstudioconf hashtag. In that batch, I found two cool efforts to gather and analyze Tweets that were produced during the conference that are worth reading:
- https://github.com/mkearney/rstudioconf_tweets (a comprehensive how-to do it post that includes sentiment analysis, network structure, etc.)
- http://johnguerra.co/viz/influentials/RStudioConf2018/ (discusses “who’s influential?” You can guess who, but it’s always interesting to look at the data.)
When I looked at all the rstudioconf Tweets I had downloaded, I found many gems that had URLs in them and had been retweeted 10 times or more. (Here are the retrieval details.) Here are some of the examples of people participating in the R organism “computing on itself” at the Rstudio::conf 2018:
Import: Open, welcome. Importing data can be hard work, right? Welcoming new people and bringing in new ideas is often not recognized as the important and challenging work that it is. Mara Averick performs a valuable service on Twitter as @dataandme. Her talk at the conference about contributing to the tidyverse unpacked the process of entering the organism and becoming a contributor to it. Marco Blume talked about using Data Camp to develop R skills and data literacy in everyone at his company who wanted to skill up. But he was also brining a new idea into the R organism’s conversation with itself: what does a social transformation around data literacy look like? What makes it happen? How do you assess progress?
Tidy: Configure. Making your data tidy can be a big undertaking. “Tidy” is more or less obvious when you get there, but can be very challenging at the beginning of the process. In JJ Allaire’s keynote about deep learning and TensorFlow in R he makes a huge step toward making TensorFlow look like a native part of R. His talk included a set of tools, a gallery and a book. It represented a massive effort to bring a whole new domain into “normal R.” And in a remarkable “no bullshit” fashion, he mentioned a recent paper casting an interesting shadow of doubt over every data scientist’s valiant effort to “clean the data.” A paper on “Scalable and accurate deep learning for electronic health records” suggests that at large scale “dirty but complete” data may have more useful predictive value than one that has been “reduced by intelligent cleaning”.
Transform: Reinterpret. Di Cook explored how to take the “traditional, ubiquitous tools” like the tidyverse and ggplot2 and connect them with another set of ubiquitous tools: randomization and replication. She argued that graphs are just the result of calculations on data and so evaluating that output requires rigorous methods and more calculation. Here are her slides. And you can use and build on a package (that she maintains) to do it yourself. As if to remind us of the iterative nature of transformation & reinterpretation, Carson Sievert discussed graphs for exploration using a JavaScript library and referring to his book and R package that is more or less at the opposite end of the spectrum from Di Cook’s talk.
Visualize: Contextualize, metabolize. Jenny Bryan’s workshop on “What They Forgot to Teach You About R” was really about helping you visualize how you work and think through how to make your work processes more orderly and rational. Rethinking a common tool like Github and adapting it to a data analysis use case is a work in progress. Jessica Minnier’s workshop notes were great! I was really surprised to notice how much I just “jump outside of R / Rstudio” to do stuff in the course of a project and how that makes my work so much less reproducible. Of course there’s a book about the GitHub part of the data analyst’s work flow.
Model: Simplify, standardize. Modeling, to use John Tukey’s words, is separating a fit from the residuals. The R community makes a great effort to name things well and (particularly in the Tidyverse) to have some consistency in package APIs. Nicholas Tierney‘s ePoster on his naniar package to profile missing data is a perfect example of taking a messy subject and wrapping it up in a neat package with a neat name and a good joke about Narnia thrown in. Jim Hester’s talk “You can make your own package in 20 minutes” simplifies the process down to the bare minimum so that we can standardize code and are free to go through the Transform, Visualize, Model cycle again.
Communicate: Share, continue. The emphasis on that “last mile” of communication and sharing results is something remarkable about the R organism. It has always given me great confidence and it was much in evidence at the RStudio Conference. One great examples of that was Yihui Xie’s “Creating Websites with R Markdown and blogdown”. Of course having enough extra compute power to crack a good joke every other slide is also personally inspiring to me. Petr Simecek‘s collection of all the conference slides organizes the whole thing and sets the ground for the next cycle of computing on R. I already got a ticket for next year’s conference.
Recommended reading: Cesar Hidalgo. Why Information Grows: The Evolution of Order, from Atoms to Economies. Basic Books, 2015. http://isbn.nu/9780465048991
No responses yet