Q&A with Daniel MacArthur

The Broad Institute and Massachusetts General Hospital genomicist talks about collecting, curating, sharing, and understanding genomes.

By Misha Angrist featured image

In the Genome Aggregation Database (gnomAD), you and your colleagues have amassed exomes (DNA sequenced from each of the roughly 30,000 genes in the human genome) from more than 123,000 people and whole genomes (all the DNA from genes plus everything else) from more than 15,000 people. How did this massive undertaking come to be?

When I started my lab almost six years ago, one of our first projects involved diagnosing patients with rare muscle diseases. Most had gone without a diagnosis for five or 10 years. We started doing exome sequencing, and one of the things that became clear was that in order to make sense of the genetic variation we were discovering in these patients, we needed to look at variation in the context of the general population. Had the DNA variants we were discovering ever been seen in healthy people? How common were they? Were they associated with these particular muscle diseases? We needed large numbers of healthy people for comparison. At the time, there were a couple of larger genome projects available with a few thousand exomes altogether. Those were useful, but not useful enough. The data was quite old, and it wasn’t consistent — it wasn’t clean enough. And there weren’t enough exomes to determine whether these rare variants actually caused disease. We needed a lot more.

We were lucky because four things came together at right about the same time. First, people began to recognize that there was a clear need for a large repository of exomes and genomes. Second, the Broad Institute had been doing a lot of sequencing — we already had data from tens of thousands of exomes. And third, we had a team that had developed a computational method for calling variants — “Is that a C or a T?” And finally, we had a set of investigators who were really happy to share their data.


Did it come together fairly easily?

[laughs] No! It failed horribly at first — a lot of things that I won’t bore you with went terribly wrong. And then, after about 18 months, the clouds parted, and we suddenly had nearly 61,000 exomes. We released that in October 2014, and that was the Exome Aggregation Consortium (exAC). Suddenly, anyone could have access to a large reference database of normal genetic variation.

Over the course of the next year or so we learned how to find regions of genes that lack variation — if genes can’t tolerate change, then that suggests they’re doing something really important as they are. And that’s turned out to be invaluable for looking for variation related to disease and understanding mutation rates. Now, with gnomAD, we have twice as many exomes, plus thousands of genomes; we can apply the same lessons but at much higher resolution.


Let’s say someone identifies a genetic variant that’s also present in a few gnomAD samples and thinks it might cause or contribute to disease. Is it possible to track down that person or even his or her medical record?

This is the most common class of question we get. The answer is that in most cases we can’t recontact them. We don’t have access to clinical records. For one quarter to one third of cases, we can go back and get more phenotype data.

That said, even though we don’t know their phenotype for sure, we can still make some pretty confident inferences. If someone finds a variant in a rare-disease patient and he or she looks for it in gnomAD and finds it in, say, 20 controls, then that’s pretty definitive evidence that it does not cause disease. Even if it’s present in just five healthy people you can pretty much rule it out.


I’m wondering if you can imagine a day when gnomAD houses millions of genomes?

I’m agnostic about whether gnomAD is the vehicle for moving this forward. Nothing lasts forever. In our next release we’re going to have more than 60,000 whole genomes and over 250,000 exomes. There’s a lot of super-cool stuff you can do once you start to get to those numbers. But the All Of Us Research Program is, we hope, going to have a million genomes, so that will probably supersede gnomAD. And the UK Biobank has 500,000 people in it and announced that it will be doing exomes and genomes on all of them. Plus, the UK Biobank will have the advantage of being able to link the sequence data with the phenotype data. Who knows what else will happen in the next three to five years?

The most important lesson I’ve learned is that if we’re going to do this properly, then we have to think carefully about the consent for data-sharing. Access to data is essential for the future of genomics. We have to stay on that and not let it slide. We have to be sure we can release aggregate data from large cohorts to the public; allow people to share their data with anyone they choose; and have a way to recontact patients and research participants.