Tag Archives: statistics
Chris Bartley worked on land mine removal as part of his undergraduate work, applying data collection and statistics to help with the process of removal. While on a trip for his research he contracted Reiter’s Syndrome. Even after he recovered he still felt like something was wrong. After consulting a physician he started tracking his wellness along with his diet and supplement intake. What follows is an amazing story about what Chris learned when he started applying his knowledge of statistics to his own data.
We’ll be posting videos from our 2013 Global Conference during the next few months. If you’d like see talks like this in person we invite you to join us in Amsterdam for our 2014 Quantified Self Europe Conference on May 10 and 11th.
I recently came across three contests relevant to the QS community, and wanted to pass them along.
1. data in sight: making the transparent visual
This is a hands-on data visualization competition held June 25th and 26th, 2011, at the Adobe Systems, Inc. offices in San Francisco’s SoMa District. Open to coders, programmers, developers, designers, scientists, members of the media—anyone who believes that data is divine and has ideas for bringing it to life. Data sets will be provided, or bring your own. (Thanks to Indhira Rojas for sending this in!)
2. Health 2.0 Developer Challenge: Washington, DC Code-a-thon
On June 11, 2011, developers, designers and other stakeholders will be given an overview of health care issues, tools and data sets, and asked to creatively design new tools for the health care space. Developers are encouraged to use OpenGov data sets as well as private data sets to create their application. At the end of the day, developers present their application to the group, and the best solution is awarded.
3. CureTogether Health Data Discovery Contest
Over the past 3 years, CureTogether has gathered millions of patient-reported data points on symptoms and treatments for over 500 conditions. But on a larger scale, how well does CureTogether data represent the general population? In this contest, stats-minded people are asked to challenge the dataset and see whether or not it holds up to existing research studies. There are cash prizes, and the deadline for joining the contest is June 29, 2011.
4. sanofi-aventis U.S. Innovation Challenge: Data, Design, Diabetes
Starting July 1, 2011, innovators can submit their best data-inspired and human centered concepts for people living with diabetes. 5 semi-finalists will receive $20,000 and professional mentoring to develop a working prototype. Following a demo day, 2 finalists will be selected to receive an additional $10,000 to test their solution in a real life diabetes community. The final winner will receive $100,000 and a month stay at the RockHealth incubator in San Francisco to turn their prototype in to a scalable solution for people living with diabetes. (Thanks to Steve Dean for sending this in!)
Good luck! If you know of any other QS-related contests, please leave a comment.
Identifying patterns is crucial in experimentation because patterns can indicate useful correlations. After all, the whole point of experimenting on ourselves and collecting data is to find ways to make changes that help us to be happier, and patterns tell us where there are points of leverage. Patterns should make us curious, and we should pay attention to them.
When analyzing our results both during and after a self-tracking effort, how do we find patterns? Unfortunately the answer is not straightforward. On the one hand our brains have evolved marvelous pattern recognition facilities that connect the dots and make meaning out of what we see in nature. In fact, human vision and verbal language are still far beyond what computer researchers can write programs to do. And our system is programmable. Just try this trick: Look around where you’re sitting and quickly scan the room. Now think about the color red and re-examine the room. Notice how everything red jumps out. Similarly, we’re easily able to recognize a face or voice in a crowd when we filter for it.
However, our ability to identify patterns is fraught with errors. This is because evolution has conservatively favored finding patterns when they don’t exist, rather than the opposite, which would hamper survival. So although we can easily find patterns, it is very possible that the two things we think are connected are not. Finding the face of the Virgin Mary on a grilled cheese sandwich is a fine example, as is falsely hearing the phone ring in the shower, hearing hidden words in a song played backwards, and seeing shapes in ink spots.
So where does this leave us when it comes to making sense of our personal data? I can think of two approaches. The first one (innate human reasoning) is readily available and generates intuitively satisfying results, but is prone to inaccuracy. The second approach (statistics) is far more accurate but more complex to apply. In fact, the field of statistics was created exactly because our human ability is so notoriously poor.
However, I think it’s still possible to find useful patterns even if we’re not good at it, and even if we don’t have the knowledge to apply sophisticated statistical models (I certainly don’t). So how do we reconcile that with our limitations? What I’d love to have is a handbook of strategies for discovering patterns in our self-tracked data. Though I’m ill-prepared for creating one, let me naively toss out some general ideas and see what you think.
How to look for patterns
As I described above, we must first be clear with ourselves that any connections we tentatively find within our data are quite possibly invalid. This means we need to question them, ask whether we have enough of the right kind of data, and continue to test them if necessary. In other words, we need to be good scientists.
Generally our goal is to find connections in the data and to look for cause and effect. To do this we look at each variable we measured and ask how it might relate to others. For personal experiments, two general kinds of causality are temporal (something happens followed consistently by something else happening, such as “My mood goes south the day after I drink alcohol”) and spatial (something that consistently takes place at a certain location or set of circumstances, such as “I feel happy when I’m riding my bike.”).
Sometimes changing the data around makes things more evident, so it can be helpful to use a visualization tool. Even the charts in your spreadsheet program might be useful. At the highest level we might reorder the data (say by quantity or magnitude), look for sequences (sorting by time), or look for repetitions or groupings. You can get more specific with your particular data. Injecting a big dose of creativity is key.
Factors to consider
Following is a Socratic-inspired set of questions that a friend might ask if you brought your data to her.
- Where were you?
- Where were you going?
- What was going on around you?
- What were you doing?
- What events were taking place?
- What did you do before you got here?
- What changed right before? Right after?
- Who were you with?
- What interactions were you having?
- What was your mental/psychological state?
- What were you thinking?
- What was your physical state?
If you can’t find useful patterns
If you are experiencing a lack of patterns, then there are two possibilities. Either 1) the pattern is there but you’re not seeing it, or 2) the data you’ve collected doesn’t manifest any patterns. To address the first problem you should bring a fresh pair of eyes to the data set. Often we become too close to the data and get stuck looking at it one way; others might see it differently.
For the case where both you and your collaborators are convinced there are no patterns evident, you need to go back to the experiment’s design and change the data you measure. We’ll save for another time the details on how to create a design, including selecting variables, but beware of throwing out the entire set of variables and starting over from scratch. Consider whether there is one more piece of information that might give you some insight, or one small change you could make in what you’re already tracking. Alternatively you might decide that you’ve tapped the experiment for all it’s worth, and move on to more fertile ground.
The joy of patterns
Finally, not only are they potentially helpful, patterns can also give us joy. After all, art gives us pleasure due to the beautiful patterns artists create. Glowing colors in a painting, rhythm in music, and the tactile pleasure of a fine weave. By being on the lookout for lovely patterns in your world you can bring yourself into the moment and appreciate the life around you.
What do you think?
I would love to hear your thoughts on this. What strategies do you use when finding patterns? How successful have they been? Is there a general set of heuristics for finding patterns in arbitrary data?
[Image from Joana Roja]
(Matt is a terminally curious ex-NASA engineer and avid self-experimenter. His projects include developing the Think, Try, Learn philosophy, creating the Edison experimenter’s journal, and writing at his blog, The Experiment-Driven Life. Give him a holler at email@example.com)
A Stanford professor in Human-Computer Interaction and Quantified Self advisor on data visualization, Heer and his colleagues Mike Bostock and Vadim Ogievetsky have put together a terrific guide to the various kinds of data visualization, and when and how to use each one.
They call their guide A Tour through the Visualization Zoo:
“In 2010 alone we will generate 1,200
exabytes–60 million times the content of the Library of Congress. Within
this deluge of data lies a wealth of valuable information on how we
conduct our businesses, governments, and personal lives. To put the
information to good use, we must find ways to explore, relate, and
communicate the data meaningfully…
Well-designed visual representations can replace cognitive calculations
with simple perceptual inferences and improve comprehension, memory, and
decision making. By making data more accessible and appealing, visual
representations may also help engage more diverse audiences in
exploration and analysis…
Creating a visualization requires a number of nuanced judgments. One
must determine which questions to ask, identify the appropriate data,
and select effective visual encodings to map data values to
graphical features such as position, size, shape, and color.”
Stops along the tour include Time-Series Data, Statistical Distributions, Maps, Hierarchies, and Networks. Each one is broken down into subtypes, with helpful examples that can be applied to your own dataset.
The authors end with a challenge:
“As you leave the zoo and head back into the wild, try deconstructing the
various visualizations crossing your path. Perhaps you can design a
more effective display?”
If you make a change to your daily routine or try a new medication, how do you know if it is working?
And if you have a question about your self-tracking for our advisors, let me know.
Bard’s Symptom Tracking Experiment:
My purpose is to correlate whether taking a particular medication helps to alleviate specific symptoms.
Medications and symptoms are tracked in a Google doc, http://spreadsheets.google.
My question is, imagine 2 columns, A representing whether medication was taken (0 or 1) and B measuring some relevant symptom (value x). The median being 99 in column B next to any 1′s in column A, and 100 in column B next to any 0′s in column A. Standard deviation 1 (or any SD really, I just wanted to find a formula that could still find a high correlation with such a small deviation, which shouldn’t be hard considering the huge amount of data that I have). This simulates a pill that works, on average, to decrease the symptoms by 1 point. Not a huge change, but extremely consistent, so worth identifying.
I would like a formula that returns the “strength” of the correlation, which in this example is approx. 100%, given a large enough data set. Any help would be greatly appreciated.
Neil Rubens’ Answer:
If I understood your questions correctly you may consider using the following two approaches to analyze your data.
1. There are many different ways of measuring dependence between variables (besides correlation); this wiki link on “dependence measurement” should provide a good place to start.
2. You can use a “statistical hypothesis test” to establish whether the difference in treatments are statistically significant (even if this difference is very small) — unlikely to have occurred by chance.
I hope this at least partially answers your questions.
David Goldberg and Teresa Lunt’s answer:
I’m not sure correlation is the best way to think about this, since one of the variables (the A column in your notation) takes on only two values, either 0 or 1.
It might make more sense to consider the two sets of symptom values, S1 for subjects who didn’t take the medication, and S2 for for those who did. Then you can use the tests developed for comparing two sets of numbers. Here are three common tests.
1. The t-test. Using the free ‘R’ statistical package on this data (where x=S1, y=S2) gives:
Welch Two Sample t-test
data: x and y
t = 4.105, df = 797.703, p-value = 4.459e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
mean of x mean of y
This not only says there is a statistically significant difference (p
= .00004), but tells you that with 95% confidence, the difference in
means between the two sets is between 0.5 and 1.4. In other words,
the symptom value in the control (no medication) group is likely to be
at least 0.5 more than in the experimental group.
2. Another possibility is the Wilcoxon rank-sum test. If you think the
symptom values are nowhere near having a Gaussian distribution, then
this would be more apppropriate. For your data (again using ‘R’)
Wilcoxon rank sum test with continuity correction
data: x and y
W = 93025.5, p-value = 6.297e-05
alternative hypothesis: true location shift is not equal to 0
Again the test shows the two sets are unequal, since p is so small (p
= 0.00006). However, you don’t get the confidence interval for the
difference of means.
3. If the data aren’t Gaussian and you want the confidence interval for
the difference in means, consider using the bootstrap.
> fn = function()
+ mean(sample(x, length(x), replace=TRUE)) – mean(sample(y, length(y), replace=TRUE))
This gives a similar 95% confidence interval as the t-test: (.47, 1.4) vs (.49, 1.4)
Thanks to Bard for the question and to Neil, David and Teresa for their answers! Brilliant and experimenting QS readers, please send in your questions and we’ll do our best to find answers for you.
His list includes blogs on statistics, visualizations, maps, design, and “others worth noting,” a category that includes our own Quantified Self blog. Thanks Nathan! (Adding Nathan’s blog to the list makes 38.)
This free, charming, how-to site tells you simply and frankly how to do your own statistical analysis, complete with an odd-sweater-wearing mathematician to take you through the necessary Excel menus. The ultimate in geeky how-to for anybody who wants to be able to ask some basic analytical questions for themselves. Why take it on faith? Grab those data sets and start calculating. Completely non-condescending, but includes some dialog with a sock puppet.