Feb 06 2017

Dogged pursuit of data quality and use

Published by at 11:49 pm under pdxdata

Data Dogs notes – Jan 2, 2017 – https://www.meetup.com/Portland-Data-Science-Group/events/236078570/

Challenge question: describe three measurements that you are familiar with at each of these three levels 1) personal, 2) very large scale, and 3) in between.  For each level describe what’s measured, why, how and and who does the measurement.  How does it all add up to good or bad measures?

Group 1 report-out

Personal examples:

  • Observing the health of a great aunt
  • Health data such as using a fitbit and syncing it with a phone
  • Tracking blood donations, keeping track of which arm is used. Noticing how being lazy about collection enters into it.

Large scale:  

  • Google in general.  
  • Noticing how adverts follow us around.  Avoiding them by using a VPN or going incognito.  All of it makes paranoia more admirable.

In between:

  • Have a customer that generates lots of data: so much that paper doesn’t work anymore. Trying an electronic version.  Find that clients collect data but don’t analyze it, don’t want to hear conclusions.  They really want to hear “their story” reflected.
  • Similar to “egoless programming,” it’s a challenge to try to disassociate yourself from the product.  So can look at the data.  
  • Example from the Kim Kaners paper: not collecting metrics because data can be used against you.  
  • Quote from a client: “You can’t trust the data because things have improved [since it was collected].”  If the business going well, maybe you don’t need to improve?
  • Which is most frustrating: getting clear, clean data or getting clients to listen or use the data?
  • Like Web Server statistics.  You can control: what’s on the website, how to advertise.  Control what’s to be measured.  Where allow advertisements.

Group 2 report out

  • A difficult topic: “Usually people have some role in measurement. So psychology always has an impact. Feelings matter in measurement.”
  • Another topic: accuracy vs precision.  Measure something vs estimate.  Is every measurement is a bit of an estimate?
  • Gripe about data we depend but where we can’t influence measurement methods: people don’t reveal their methodology, even though it’s important.  
  • Spent time on last: formalized process, change methods, be more precise, summarize.  Subjective data.  Tracking assessments of hires.  Categorical data.  Journal entries.  Different people using different codes.  Severity codes.  What does a 4 severity mean?
  • When you formalize the process, it seems like you spend a lot of time on the measurement.  
  • GitPrime.com: “Data-driven visibility for modern software teams.” A new githost company that sells you statistics about your developers.  Measuring programming productivity.  “Software teams have never had a shred of hard data to bring into discussions. Forward-thinking organizations are moving beyond subjective measures like tickets-cleared, and evaluating their work patterns based on concrete data. Knowing immediately when an engineer is stuck or spinning their wheels helps managers do their jobs in a way that has never been possible. Measuring things like churn, codebase impact, and true cost of paying down technical debt, allows engineering to demonstrate how they’re meeting the business objectives at hand.”  cdn2.hubspot.net/hubfs/2494207/Content/whitepaper/GP-DataDrivenTeams.pdf
  • Goodhart’s law: “when a measure becomes a target, it ceases to become a good measure.”

Group 3 report out


  • Body weight is a big one.  
  • Getting things done. Todo lists.  Tracked or not.  

Large scale:

  • Statistics around gun violence.  How get the data, understand it.  Measurement problems: there are laws against measurement!   Scraping the net for data.  Washington Post efforts to collect data about gun violence.  
  • Data about electronic components not shared.

In Between:

  • Own data science project: sampling size. Counting criteria. What measuring  affects accuracy, calculations.  More samples means more accuracy but also more time gathering data.  Facing the tradeoffs gathering cell counts & brain samples. Counting standards. A project on dementia.  Looking through a microscope.  Cells picked out.  Candy turtles.  
  • Trying to get out of manually measuring (or manually entering data about) anything, mechanising all data collection.  Practices vary by industry.  In the electronics industry the unit is “defects per million”.  Different in a new ERP system.  
  • Economic incentive to do as little manual entry as possible: it takes time and money.

Group 4 report out


  • Measuring the amount of time spent executing a job during the day.  Getting at “what’s personal efficiency?” How measure it? Difficulties and vagaries of measurement.  Have to remember to measure it throughout the day.  Bottom line: billing clients with it.  
  • Anecdotal (“I had a bad Tuesday”) vs. data-driven decisions.
  • When business doubled, who is wondering about capacity in the factory?


  • How businesses handle item/master data is frequently messy.  People use a field in the system for very different purposes than what was intended or for what others use it.  More communication can lead to more or standardization.  What a field “means” and how you are supposed to use it.  Variant uses of a field can also be clever.
  • An example of data variability reduction: Subway, Inc. studied supplier data provided by its franchisees.  They found that  80% of supplier data was inaccurate across the whole supplier chain.  Now they get purchase amount data from supplier and push it to the individual franchisees to standardize reporting about franchise performance.  

We had a general discussion about assessing causation when direct experimentation is impossible using the “Hill Criteria” that Roger Peng mentioned in a recent podcast: https://en.wikipedia.org/wiki/Bradford_Hill_criteria

Thomas E. Kida, Don’t Believe Everything You Think: The 6 Basic Mistakes We Make in Thinking

  • We prefer stories to statistics
  • We seek to confirm
  • We rarely appreciate the role of chance and coincidence in life
  • We can misperceive our world
  • We oversimplify
  • We have faulty memories


No responses yet

Leave a Reply