Tag Archives: open data

The QS World I Would Like to Live In

In the QS world I’d like to live in, our personal data would be easily available to us to learn from using many different methods and tools. Here are some conditions I think would make this easier:

  1. Data can be exported from the various systems we use into a simple format for exploration.
  2. We can store and backup our data using whatever method we want.
  3. We can share our data with whomever we want.
  4. We can rescind permission to look at our data.
  5. We can flow our data into diverse visualization templates and analytical systems.

I’ve tried to express these conditions briefly and simply, but any of them – and certainly all of them together – require changes in the systems we currently use, and these changes may be challenging for technical, business, social, and political reasons.

I know many people in our community have worked on parts of this problem, and I’m interested in your comments and ideas.

Posted in Discussions, Lab Notes | Tagged , | 11 Comments

Toolmaker Talks: Bastian Greshake (openSNP)

We talk about very frequently here on the QS website about tools, methods, and systems that help us understand ourselves. When it comes to the self there may be nothing more fundamental to understanding our objective ourness than our basic genetic makeup. Many of you have probably undergone or have thought of using Direct-To-Consumer genetic testing to better understand your phenotypes, disease risk, or even your ancestry. That’s all great, and I’ve spent a lot of valuable time combing through my own genetic data, but like most data true power lies in large datasets that provide observations across many individuals. So how do you participate in that type of sharing and learning? Enter the team at the openSNP.org. Today we talk with Bastian Greshake, one the developers behind the openSNP project.

How do you describe openSNP? What is it?

The too long, didn’t read version: A open platform which allows people to share their genetic information and traits, which are suspected to be at least partially genetically predisposed, which also tries to annotate those genetic variants with primary scientific literature. The data can be exported from openSNP through the website or through APIs, making it easy to re-use the data.

A longer version: openSNP has basically two target groups and users may as well fit in both categories.
First there are customers of Direct-To-Consumer (DTC) genetic testing like 23andMe who want to share their genetic information with the public for various reasons. Those can use openSNP to release their genetic data into the public domain using the Creative Commons license which is applied to the data uploaded and entered in openSNP.
As genetic information is interesting but not very useful to analyze the effect of genetic variants on bodily traits those users can also enter information about traits which might genetically influenced and create new possible categories which all other users then can enter. Those traits range from the more obvious ones, like eye and hair color, to more exotic ones like political ideology. A few weeks ago we also created a method for  users to also connect their Fitbit accounts to openSNP to make the collection of data easier and more standardized. The genetic effects on activity, sleep habits and weight loss/gain can more easily be analyzed in this fashion.

We also mine the databases of Mendeley, the Public Library of Science and the SNPedia to annotate the genetic variants users carry. This allows customers of DTC testing to find out what the recent scientific literature is able to tell them about their genetic variants. While the SNPedia is a crowd-curated Wiki, Mendeley and the Public Library of Science link back to primary literature, in the latter case even to Open Access literature which is full text available for everyone.

The second group of users who are interested in openSNP are scientists and citizen scientists who are interested in using the data for their own studies, be it to figure out what genetics can tell us about our ancestry or which effects single variants have on disease risks or other traits. The data can be downloaded from openSNP in bulk or more granularly accessed through a JSON-API and the Distributed Annotation System, a standard in Bioinformatics, which for example is used to visualize the data.

Both groups can profit from the commenting features which allows users to communicate about traits and individual genetic variants. The internal message system of openSNP also facilitates further communication, for example to share details about shared traits and diseases or to allow people who want to use the data to get back in touch with the people who uploaded the data. The latter one enables the direct exchange between those two user-groups in a bidirectional way: Researchers can ask questions about traits and people who have shared their data have a back channel as well and can get notified about the results researchers have made.

What’s the backstory? What led to it?

It more or less began with me getting my genetic information analyzed by 23andMe myself. After I received the results I published the data in a git repository on GitHub to make it available for others who might benefit of having more data. As I started to dig deeper into my own results and the raw data I wanted to have more data sets myself, to be able to compare the results. But unfortunately there wasn’t a single resource for such data. Some people also had published their data on GitHub, others on their own websites, collected publicly available data sets in a Google Spreadsheet or participated in projects like the Personal Genome Project.

This was quite frustrating: Finding the data was hard and it most often there was no additional data about traits attached. And more often than one would expect there was also no way to contact to people who made the data public. So the idea to create a platform to solve this problem grew and I contacted some friends to see if they were interested in doing such a platform, just for fun. We started out with the basic idea of creating a platform where people could upload their genetic data along with some traits they have. A couple of weeks after we started to work on the project we stumbled upon the APIs of Mendeley & the Public Library of Science and thought it might be cool to include additional data about the genetic variants as well. During the development we came up with more and more features, like the openSNP APIs. All in all the project is still growing and we’re working on adding and refining features.

What impact has it had? What have you heard from users?

We submitted the first release of openSNP to the 2011 PLOS/Mendeley Binary Battle, a competition interested in creative ways to use their APIs and won the first prize. We also secured a small grant from the German Wikimedia Foundation, which allowed us to genotype over 20 people, mainly from underrepresented groups, to diversify the available data. Those persons have now released their genetic data on openSNP as well. Right now we have over 250 genetic data sets on openSNP and just short of 600 registered users. Those numbers don’t sound to impressive in the age of one billion people on Facebook. But to put it into perspective: Genetic testing is still a niche thing and before openSNP was released there were about 40-50 of those data sets publicly available.

The feedback of our users has been very positive. Many users come up with new ideas for features they like to see added and we are really open to those suggestions and critiques. Many of the API methods, which are now implemented (and the whole Distributed Annotation System), are only in place because user let us know they wanted them. I know of users who are actively using openSNP to learn more about their test results and are in an active exchange with other users with similar traits. And while the amount of data we have so far doesn’t really allow scientifically sound studies there are already people using the data, for example there are users who run their self-written analysis-tools over the openSNP-data sets and report the results back to the users, which is amazing.

What makes it different, sets it apart?

Of course we’re not really the first to think of such an idea but are more or less a remix. For example 23andMe themselves do use the data of consenting customers for studies. They also provide questionaries about traits which users can take. But this data isn’t available to the public, due to (perfectly reasonable) concerns in terms of privacy, bio-ethics and liability. On the other hand there are projects like the Personal Genome Project, which publishes traits and genetic data of participants into the public domain. But due to similar reasons like with 23andMe the participation in the project isn’t open to everyone.

We feel that informed individuals should be in the position to share their data with the world, like they are already doing on their own websites, in an easy fashion. And of course we’re targeting a slightly different group: Probably over 150,000 people are customers of some DTC genetic testing, this is a huge potential data source which could be used to help us understand new and exciting things.

What are you doing next? How do you see openSNP evolving?

We’re still developing and refining openSNP. One of the biggest problems right now is the quality of the data for the additional traits. We have kept the process of adding data really open on purpose, to make it easy for people to provide additional information about themselves. Unfortunately this has the side-effect that the quality of the descriptions varies wildly. Those problems start of with regional idiosyncrasies: Is it “Eye Color” or “Eye Colour” and are you using the metric or the imperial system of units? And is your eye color blue or “Indeterminate brown-green with a subtle grey caste”? This granular data can be very useful, but for many applications it can be too specific. With the implementation of the Fitbit API we’ve taken a first step to keep the entering of data simple but unified at the same time. And we’re currently looking into other ways of how one could counter problems like this.

We’re also looking in more data sources to annotate the genetic variants listed in openSNP, to provide even more information for customers of DTC testing. And we’re also working on making our APIs more powerful. With the rOpenSci package there is already a great library which makes use of the APIs in the current state, but of course we would like to see more of those libraries.

And it’s hard to say in which direction openSNP will evolve as we are a bit dependent on the DTC genetic testing industry. More and more data, like Whole-Genome or Exome Sequencing, is generated and we are working on reflecting those changes on openSNP as well. And we’re open for any suggestions. So if you find that a feature is missing you should let us know, we will try to work out a way of how this might be usefully implemented.

Anything else you’d like to say?

First of all: We know, genetic information is sensitive and depending on where you are living there might not even be laws to protect you from discrimination based of your genes. Other countries, like the US with the Genetic Information Discrimination Act (GINA), have some mechanisms against this, but even those might not offer total protection in the end. And you should also keep in mind that your genetic information does not only give away details about yourself, but by design also about the next of kin. I think this is really important. If you are thinking about publishing your genetic data please keep those issues in mind. And if you come to the conclusion that this isn’t for you as you have to fear negative repercussions or just have a gut feeling of not really wanting to publish the data: Please don’t do it.

And what I also can’t stress enough is that openSNP is developed and run by a team of about four people and we are all doing this in our spare time as a fun project and as community service, without compensation. Some of us have day jobs, others are still studying and some even do both. So while we are doing our best to keep everything running it might sometimes take a while. But if you feel like contributing to the project please get in touch with us. We’d love to have more people in on this.

Product: openSNP
Website: www.opensnp.org
Price: Free

Authors note: Data sharing, especially genetic data, is a very sensitive topic in our community. I want to fully disclose my bias towards openness and sharing. I believe that our kindergarden teachers had it right when they taught us that sharing is one of the fundamental human traits we should all cultivate. To this end, I have participated in openSNP and you can view my genetic data here and my Fitbit data here

This is the 18th post in the “Toolmaker Talks” series. The QS blog features intrepid self-quantifiers and their stories: what did they do? how did they do it? and what have they learned?  In Toolmaker Talks we hear from QS enablers, those observing this QS activity and developing self-quantifying tools: what needs have they observed? what tools have they developed in response? and what have they learned from users’ experiences? If you are a “toolmaker” and want to participate in this series, contact Rajiv Mehta or Ernesto Ramirez.


Posted in Toolmaker Talks | Tagged , , , , , , , , | 1 Comment

Numbers From Around the Web: Round 11

I’m typing this post while flying back to Southern California after spending a few days at a “Big Data” conference in San Francisco. One of the best things about the conference was meeting the subject of today’s round of Numbers From Around the Web. I first stumbled upon Bastian because he’s the main instigator and developer behind a great project called openSNP. Simply put, openSNP is a place you can host your direct-to-consumer genomic data for the world to see, understand, play with, download, well you get the point. This is a really interesting phenomenon that deserves it’s own post, but we’re going to explore some really neat QS experimentation and learning Bastian engaged in to better understand his sleep.

Bastian, On Sleep

So, I found out about Bastian because openSNP announced that they also built a method to link and host Fitbit data (you can do that here if you’re so inclined). Turns out Bastian is an avid Fitbit user and has been using it to explore his sleeping patterns. His  first major insight from his data indicated that in about 5.5 years he should be sleeping 24hrs per day:

So I downloaded my data from openSNPand started playing around with it: I did a simple linear regression over the time series and could indeed find a trend towards more sleep. The regression came out as y = 0.5x + 417, which ± says that for each two days that pass I will sleep a minute longer, which also means that it will be about 2000 days (or 5.5 years) until I will sleep 24 hours a day.

So yes, obviously regression may not be the best tool in the statistical toolbox to understand sleep so he decided to examine another question: “Do I sleep better or worse when there is someone in bed with me?” Using his sleep and calendar data he was able to identify nights he spend alone and nights he slept next to a warm body and found some pretty interesting stuff.

Sleeping alone vs. sleeping with a companion

You can clearly see here in his table of 80 days of sleep (60 alone vs 20 with a companion) that he actually tends to sleep worse when he is sharing his bed. While he spends more time in bed, he takes longer to fall asleep, spends more time awake, and is awakened more often. For those of you who are not statically inclined those p-values indicate the probability that the difference in the two categories is due to chance (you can learn more about p-values here).

Like many good scientists he dug deeper to make sure what he was observing wasn’t related to other confounding variables such as the day of week:

weekdays vs. weekend days

Bastian didn’t find any significant differences in sleep quality between weekend and week days for his sleeping situation, but as one might expect he’s less active and sleeps more on weekends.

So, while this analysis might seem simplistic, one of the great things about Bastian and what he’s developing at openSNP is his willingness to be open with his data. Do you have some ideas about what you might find about Bastian from his Fitbit data? Have another hypothesis about sleep? Well you can test it out by downloading his data! You can start by reading his excellent post about this sleep analysis here.

Every few weeks be on the lookout for new posts profiling interesting individuals and their data. If you have an interesting story or link to share leave a comment or contact the author here

Posted in Numbers from Around the Web | Tagged , , , , , , | 3 Comments

Numbers From Around the Web: Round 9

Some people may be wondering how I find all the amazing people conducting neat self-tracking experiments and creating jaw-dropping personal data visualizations. Well, for the most part I just listen. I’m constantly paying attention to what’s being said on twitter about #QuantifiedSelf. When that doesn’t work I just use the power of Google to find people who are blogging about self-tracking, self-experimentation, or personal data. It’s great to look through the search results and see how many people are sharing their personal stories and insights. While doing some searching this morning I stumbled across a project that immediately brought a smile to my face. Hopefully you’re excited by this as much as I am.

Chris Volinsky is a statistician at AT&T Research and he’s no stranger to handling large data problems. Back in 2008 he was part of the team that won the $1 Million Netflix prize. He also has quite the impressive list of research papers that illustrate the many different uses of cellphone location data. But what is really interesting about Chris is his newest project: My Year of Data

Back in November of 2011 Chris started off a blog entry that with this:

My name is Chris. I am 40 years old. I am 5’9 1/2″ and weigh 174 pounds. I walked 9,048 steps and have consumed 1,406 calories today (so far).

Realizing that he’ld been gaining weight and wasn’t at his optimal health he decided to take a data-centric approach to improving his health. He is a statistician after all. So far, he’s found some interesting things. Take for instance his weight and dietary tracking.

As he explains in this post, Chris typically has a hard time tracking his diet consistently. This can be pretty frustrating when you hear about how important it is to eat this or not eat that to help with weight reduction. Rather than get frustrated Chris turned to the data to see what he could learn. When he stopped looking at the data he was entering and started looking at the missing data an interesting trend lept out. He found that fluctuations in his weight appeared to be correlated with whether or not he was logging food. Take for instance the plot below. It appears that there is a pretty clear association with periods of weight loss and periods of actively logging his food (pink zones). The opposite also appears to be true – no food logging = weight gain.

Weight chart with food tracking highlighted in pink

So this is where a typical NFATW post would stop. We have an interesting finding and a neat data visualization. But, Chris is doing something much more interesting than just talking about his weight data. He is on a long-term self-tracking and self-discovery journey and he is trying to enlist other interested parties to help him. Chris is going the extra step and posting all of his self-tracking data online for anyone to analyze, visualize, or just get inspired.

You can access all of his amazing data via a public dropbox folder that he’s set up. He even has a nice README file explaining the datasets and formats. So far he’s sharing the following:

  • Fitbit: sleep and activity data
  • FitLinxx: weight training data from gym activities
  • Livestrong: dietary tracking data
  • Runkeeper: running and other exercise activity data
  • RescueTime: productivity tracking (computer/internet use)

All the data is open and available for you to play with. This should be a really interesting project to keep “track” of in the future (pun definitely intended). To help inspire some action on your part I took some time today and looked at Chris’s most recent available data to see what I could find out. I downloaded his Fitbit data and decided to look for any interesting patterns. Turns out that when taking a look at his daily patterns of activity there seems to be something going on on Thursdays that reduces his step count and activity time . Also, Saturday is by far the best day with an average of 9,862.56 steps and a 5.3 hours spent being active (data available here).

Mean steps per day

Mean activity minutes per day

Make sure to reach out to Chris over at his blog and take a took at his data to see what interesting thing you can figure out!

Every few weeks be on the lookout for new posts profiling interesting individuals and their data. If you have an interesting story or link to share leave a comment or contact the author here.

Posted in Numbers from Around the Web | Tagged , , , , , , | 1 Comment