2011-09-22

Data sharing in the life sciences

What is the best way to create, share and maintain biological databases?

If nature has made any one thing less susceptible than all others of exclusive property, it is the action of the thinking power called an idea... -- Thomas Jefferson, 13 Aug. 1813

The life sciences these days are turning into a high throughput data enterprise. Where in the old days you ran one agarose gel, the advance of technology today has spawned a massively parallel measurement extravaganza. Now, you measure hundreds of compounds and proteins, thousands of expressed genes, and billions of base pairs. And you store the results as digital content.

It has been called a wave, an avalanche, an explosion, a deluge of data: it is huge and valuable. It is promising to transform healthcare. And since it is digital, it can be copied with close to zero cost. In this, it is more similar to an idea than to a blood sample: by sharing, now two people can fully enjoy it. This is creating interesting questions: who should have access to it? How can you protect privacy?

Who should have access to data?

Our society endorses the principle of private property. So the answer to this question has been: if you pay to create the data, you own it, and you decide who can access it. It does not matter that you can make a copy basically for free. Only the creator has this copy-right.

Songs are protected by copyrighted to allow studios and artists to make money of them, so that they will record songs. Similar is true for databases, and because copyright was invented for creative works like songs and books, the EU has created a special sui generis right which protects databases that have cost a significant amount of money, on top of it. As a result, much scientific data gathers mold in the server catacombs of industry, of hospitals, of research institutes. Its owners see no reason to share access, even if they do not make use of it themselves. They paid for it, they may do with it as they damn well please. To some, that just seems wrong.

Because the owner decides about access, the question about who should have access quickly becomes a question about who pays for the databases.

For scientific data, there is the argument that by making the data freely available to all, anyone can try to generate insight from it, accelerating the whole enterprise of science and better the lot of mankind. Following this reasoning, biomedical data is a common good, and database creation and maintenance should be funded by the tax payer. Then the resulting data can be made freely available to all.

Fortunately there are literally thousands of specialized biological databases (see the recent NAR database issue, which lists some 1300). Unfortunately government may find it difficult and costly to pick who should receive funding (see some tangential, but entertaining articles by Paul Graham on the subject of government and investment in startups are here and here.)

Few databases should be maintained forever. They may reach the end of their utitilty, or never even get used enough in the first place. But how to measure the value a database provides? This issue has been identified as a major problem by a high level expert panel in a recent report on data infrastructures to the EU.

One solution could be to focus direct funding on the main, central databases like Uniprot or Ensembl, that everybody from small companies, to big pharmaceutivals, to academic research uses regularly, and where the role as basic infrastructure for the life sciences is not a question.

For the many more specialized databases, the classic solution to put a value on things is well known for hundreds of years: it is a market.

One should not fund smaller databases directly, but allow users to fund the ones they need via subscription. This would create a market where users can vote on value by paying a price. It would be easy to make allowance in grant funding for database access, as it currently is done for reagents and other consumables. The system would be self regulating, in that unused databases would vanish, and successful ones would be able to grow.

how can you protect privacy?

The second large issue about data access has to do with the protection of privacy. Do you want your genome data publicly accessible to your neighbor? Your kid's data to his classmates?

Most people wouldn't. And so it is very difficult to get legal access to clinical data.

However, we learn in this second millenium that digital content is impossible to keep private: classified government files show up on Wikileaks. Songs from Justin Bieber are shared by teenage girls all round the world. This has huge implications for genome data. As soon as content becomes valuable to individuals, someone will crack it and upload it to some file share, and that's that about control. The only reason something will not be stolen is if it is not interesting.

The other side of the coin is that it might be useful for medical research, to have genomic data and electronic health records available for hundreds and thousands of people, so you could systematically search for correlations. Just imagine how much easier it would be to select case populations for clinical trials. But even if you wanted, it would in many cases not even be possible to get access to your own electronic health record.

To address this issue, there are some intiatives, like the personal genome project. Two days ago I met John Wilbanks of creative commons, who is starting an exciting project to create for consent forms what creative commons has done for licenses.

How can you find the data?

A critical issue with open data sharing is that with more and more data, it will become difficult to identify data that is worthwhile to spend your time on. It's like having hundreds of channels on the TV. You need a guide, or need to subscribe to a pay channel, where you know it is worth your time. In a world of data abundance, trust makes data valuable. It guarantees quality and provenance, clean, annotated and organized data, data that is comparable and searchable. iTunes shows us that the best, if not the only way to sell digital content in the end is to sell convenience.

We see the same effect with publications. In the last 10 years, the annual number of publications in pubmed has more than doubled. To find all releavant papers about a given special subject area, is close to impossible in reasonable time. You have to resort to specialized databases, to more quickly find relevant papers for your gene, mutation or disease of interest.

Already today, a major obstacle to data sharing is not access or cost of access, it is cost of integration. The number of projects that set out to create a multidimensional view of experimental data -- proteomics, genomics, transcriptomics is large. The integration often remains a dream: it is painfully hard to integrate data from different sources, let alone different kinds of data. There are a number of projects that aim to lower these barriers by establishing common standards and vocablularies, like Biosharing, Bioportal. MIBBI. There are other effort to create open tools and standards, like the Society for Biocuration, bioDBcore, a standard for describing biological databases, and toolkits for curators such as Biomart or ISA-tools.

No comments:

Post a Comment