Toolmaker Talks: Bastian Greshake (openSNP)
October 30, 2012
We talk about very frequently here on the QS website about tools, methods, and systems that help us understand ourselves. When it comes to the self there may be nothing more fundamental to understanding our objective ourness than our basic genetic makeup. Many of you have probably undergone or have thought of using Direct-To-Consumer genetic testing to better understand your phenotypes, disease risk, or even your ancestry. That’s all great, and I’ve spent a lot of valuable time combing through my own genetic data, but like most data true power lies in large datasets that provide observations across many individuals. So how do you participate in that type of sharing and learning? Enter the team at the openSNP.org. Today we talk with Bastian Greshake, one the developers behind the openSNP project.
How do you describe openSNP? What is it?
The too long, didn’t read version: A open platform which allows people to share their genetic information and traits, which are suspected to be at least partially genetically predisposed, which also tries to annotate those genetic variants with primary scientific literature. The data can be exported from openSNP through the website or through APIs, making it easy to re-use the data.
A longer version: openSNP has basically two target groups and users may as well fit in both categories.
First there are customers of Direct-To-Consumer (DTC) genetic testing like 23andMe who want to share their genetic information with the public for various reasons. Those can use openSNP to release their genetic data into the public domain using the Creative Commons license which is applied to the data uploaded and entered in openSNP.
As genetic information is interesting but not very useful to analyze the effect of genetic variants on bodily traits those users can also enter information about traits which might genetically influenced and create new possible categories which all other users then can enter. Those traits range from the more obvious ones, like eye and hair color, to more exotic ones like political ideology. A few weeks ago we also created a method for users to also connect their Fitbit accounts to openSNP to make the collection of data easier and more standardized. The genetic effects on activity, sleep habits and weight loss/gain can more easily be analyzed in this fashion.
We also mine the databases of Mendeley, the Public Library of Science and the SNPedia to annotate the genetic variants users carry. This allows customers of DTC testing to find out what the recent scientific literature is able to tell them about their genetic variants. While the SNPedia is a crowd-curated Wiki, Mendeley and the Public Library of Science link back to primary literature, in the latter case even to Open Access literature which is full text available for everyone.
The second group of users who are interested in openSNP are scientists and citizen scientists who are interested in using the data for their own studies, be it to figure out what genetics can tell us about our ancestry or which effects single variants have on disease risks or other traits. The data can be downloaded from openSNP in bulk or more granularly accessed through a JSON-API and the Distributed Annotation System, a standard in Bioinformatics, which for example is used to visualize the data.
Both groups can profit from the commenting features which allows users to communicate about traits and individual genetic variants. The internal message system of openSNP also facilitates further communication, for example to share details about shared traits and diseases or to allow people who want to use the data to get back in touch with the people who uploaded the data. The latter one enables the direct exchange between those two user-groups in a bidirectional way: Researchers can ask questions about traits and people who have shared their data have a back channel as well and can get notified about the results researchers have made.
What’s the backstory? What led to it?
It more or less began with me getting my genetic information analyzed by 23andMe myself. After I received the results I published the data in a git repository on GitHub to make it available for others who might benefit of having more data. As I started to dig deeper into my own results and the raw data I wanted to have more data sets myself, to be able to compare the results. But unfortunately there wasn’t a single resource for such data. Some people also had published their data on GitHub, others on their own websites, collected publicly available data sets in a Google Spreadsheet or participated in projects like the Personal Genome Project.
This was quite frustrating: Finding the data was hard and it most often there was no additional data about traits attached. And more often than one would expect there was also no way to contact to people who made the data public. So the idea to create a platform to solve this problem grew and I contacted some friends to see if they were interested in doing such a platform, just for fun. We started out with the basic idea of creating a platform where people could upload their genetic data along with some traits they have. A couple of weeks after we started to work on the project we stumbled upon the APIs of Mendeley & the Public Library of Science and thought it might be cool to include additional data about the genetic variants as well. During the development we came up with more and more features, like the openSNP APIs. All in all the project is still growing and we’re working on adding and refining features.
What impact has it had? What have you heard from users?
We submitted the first release of openSNP to the 2011 PLOS/Mendeley Binary Battle, a competition interested in creative ways to use their APIs and won the first prize. We also secured a small grant from the German Wikimedia Foundation, which allowed us to genotype over 20 people, mainly from underrepresented groups, to diversify the available data. Those persons have now released their genetic data on openSNP as well. Right now we have over 250 genetic data sets on openSNP and just short of 600 registered users. Those numbers don’t sound to impressive in the age of one billion people on Facebook. But to put it into perspective: Genetic testing is still a niche thing and before openSNP was released there were about 40-50 of those data sets publicly available.
The feedback of our users has been very positive. Many users come up with new ideas for features they like to see added and we are really open to those suggestions and critiques. Many of the API methods, which are now implemented (and the whole Distributed Annotation System), are only in place because user let us know they wanted them. I know of users who are actively using openSNP to learn more about their test results and are in an active exchange with other users with similar traits. And while the amount of data we have so far doesn’t really allow scientifically sound studies there are already people using the data, for example there are users who run their self-written analysis-tools over the openSNP-data sets and report the results back to the users, which is amazing.
What makes it different, sets it apart?
Of course we’re not really the first to think of such an idea but are more or less a remix. For example 23andMe themselves do use the data of consenting customers for studies. They also provide questionaries about traits which users can take. But this data isn’t available to the public, due to (perfectly reasonable) concerns in terms of privacy, bio-ethics and liability. On the other hand there are projects like the Personal Genome Project, which publishes traits and genetic data of participants into the public domain. But due to similar reasons like with 23andMe the participation in the project isn’t open to everyone.
We feel that informed individuals should be in the position to share their data with the world, like they are already doing on their own websites, in an easy fashion. And of course we’re targeting a slightly different group: Probably over 150,000 people are customers of some DTC genetic testing, this is a huge potential data source which could be used to help us understand new and exciting things.
What are you doing next? How do you see openSNP evolving?
We’re still developing and refining openSNP. One of the biggest problems right now is the quality of the data for the additional traits. We have kept the process of adding data really open on purpose, to make it easy for people to provide additional information about themselves. Unfortunately this has the side-effect that the quality of the descriptions varies wildly. Those problems start of with regional idiosyncrasies: Is it “Eye Color” or “Eye Colour” and are you using the metric or the imperial system of units? And is your eye color blue or “Indeterminate brown-green with a subtle grey caste”? This granular data can be very useful, but for many applications it can be too specific. With the implementation of the Fitbit API we’ve taken a first step to keep the entering of data simple but unified at the same time. And we’re currently looking into other ways of how one could counter problems like this.
We’re also looking in more data sources to annotate the genetic variants listed in openSNP, to provide even more information for customers of DTC testing. And we’re also working on making our APIs more powerful. With the rOpenSci package there is already a great library which makes use of the APIs in the current state, but of course we would like to see more of those libraries.
And it’s hard to say in which direction openSNP will evolve as we are a bit dependent on the DTC genetic testing industry. More and more data, like Whole-Genome or Exome Sequencing, is generated and we are working on reflecting those changes on openSNP as well. And we’re open for any suggestions. So if you find that a feature is missing you should let us know, we will try to work out a way of how this might be usefully implemented.
Anything else you’d like to say?
First of all: We know, genetic information is sensitive and depending on where you are living there might not even be laws to protect you from discrimination based of your genes. Other countries, like the US with the Genetic Information Discrimination Act (GINA), have some mechanisms against this, but even those might not offer total protection in the end. And you should also keep in mind that your genetic information does not only give away details about yourself, but by design also about the next of kin. I think this is really important. If you are thinking about publishing your genetic data please keep those issues in mind. And if you come to the conclusion that this isn’t for you as you have to fear negative repercussions or just have a gut feeling of not really wanting to publish the data: Please don’t do it.
And what I also can’t stress enough is that openSNP is developed and run by a team of about four people and we are all doing this in our spare time as a fun project and as community service, without compensation. Some of us have day jobs, others are still studying and some even do both. So while we are doing our best to keep everything running it might sometimes take a while. But if you feel like contributing to the project please get in touch with us. We’d love to have more people in on this.
Authors note: Data sharing, especially genetic data, is a very sensitive topic in our community. I want to fully disclose my bias towards openness and sharing. I believe that our kindergarden teachers had it right when they taught us that sharing is one of the fundamental human traits we should all cultivate. To this end, I have participated in openSNP and you can view my genetic data here and my Fitbit data here.
This is the 18th post in the “Toolmaker Talks” series. The QS blog features intrepid self-quantifiers and their stories: what did they do? how did they do it? and what have they learned? In Toolmaker Talks we hear from QS enablers, those observing this QS activity and developing self-quantifying tools: what needs have they observed? what tools have they developed in response? and what have they learned from users’ experiences? If you are a “toolmaker” and want to participate in this series, contact Rajiv Mehta or Ernesto Ramirez.