Genome in a Bottle

Reference materials help to ensure accuracy in clinical genome tests.

By Kendall K. Morgan featured image Illustration by Alec Doherty

You don’t have to spend much time with the National Institute of Standards and Technology’s catalog of Standard Reference Materials (SRMs) to realize it’s a pretty unusual place to go online shopping. There’s the $867 jar of peanut butter (word on the street is that it isn’t even particularly tasty). For a minimum of $721, you can order up some human urine laden with contaminants of various kinds, including nicotine and arsenic. (Of course, you’ll have to decide whether you want frozen or freeze-dried.) Fifty grams of New York/New Jersey waterway sediment now goes for $778. And, as of last year, the SRM catalog officially entered the genome era, offering a collection of extremely well characterized human “genomes in bottles” (European, Ashkenazi Jewish, or Chinese) at a cost of $451 each. It’s $1,382 for a trio — three vials containing the DNA of two elderly Ashkenazi Jewish parents and their adult son.

Marc Salit, who has worked at the National Institute of Standards and Technology (NIST) for more than 25 years and now leads its Genome-Scale Measurements Group, calls the collection of genomes the “most audacious reference-material project ever undertaken. We’re trying to deliver tubes of DNA where we are asserting that we understand an unprecedented number of characteristics in a given tube — the magnitude is beyond anything we’ve ever thought of.”

If you’re curious why SRMs are so valuable and costly, the reason isn’t really about what’s inside the containers. They’re expensive because of the effort that’s gone into characterizing their contents so thoroughly and precisely. SRMs have been in use since the early 1900s as gold standards to ensure laboratory instruments are working properly, buildings are safe, and food labels are accurate. Now, they can also be used to evaluate and ensure the accuracy and judicious interpretation of genome test results, which is becoming increasingly important as whole-genome and whole-exome sequencing make their way from research laboratories to clinical settings and even become part of the All of Us Research Program (formerly the Precision Medicine Initiative). Genetic reference materials are also useful for the development and testing of new and improved genome sequencing technologies and computational algorithms that crunch the data. In some sense, the new SRM collection is yet another indication that clinical genome testing has arrived.

Check Your Work

The development of genomic SRMs might have been ambitious, but their primary aim is to make it possible to answer a pretty basic question: So you sequenced my genome. How well did you do? How do you know you got it right? When you consider that the human genome contains about 3 billion bases inherited from each biological parent, including stretches that can be riddled with repetitive sequences or rearranged in complex ways, assuring you’ve gotten it right — and knowing when you ha-ven’t — isn’t so easy, even if you can now buy a whole-genome sequence for as little as $1,000.

“It turns out sequencing a genome is actually a pretty hard thing to do comprehensively because of the nature of the technology [used] to characterize a genome and the nature of the genome,” Salit explains. “We’re really good at most of the genome, and we’re not very good at the rest of it. We’re really good at small variations in the genome and understanding accurately the sequences of little pieces of the genome. But we are less good at understanding the big picture and where things actually exist in any given genome. So we’re trying to put out a handful of genomes that are really well characterized so people can use those to develop new methods, to understand how well their current methods are working, and so that the whole field can get better at the things we’re not very good at yet.”

We’re trying to deliver tubes of DNA where we are asserting that we understand an unprecedented number of characteristics in a given tube — the magnitude is beyond anything we’ve ever thought of.

The standardized human genomes are the work of the NIST, together with partners at the Genome in a Bottle (GIAB) Consortium, including members of the federal government, academia, and industry. The effort, aimed at providing laboratories with the tools to advance clinical applications of whole genome sequencing and the Food and Drug Administration (FDA) with the ability to evaluate and regulate them, began with a series of conversations among members of the genomics community.

By 2012, says Justin Zook, a researcher at the NIST, it was clear that the interest and need were there. Zook explains that “reference genome” still means different things to different people. The first reference human genome assembly was put together in 2000. That DNA sequence data has been built and rebuilt 38 times over the years, pieced together from the genomes of multiple individuals, although much of it came from an African-American man known as RP11. It’s still in use, and, in fact, Zook and his colleagues rely on that reference data when analyzing their newly available genomes. What sets this newer effort apart is that, in addition to offering new genome data gathered using more than a dozen different available genome technologies, it makes the actual DNA available too — the physical tubes of the stuff — for others to analyze and re-analyze to see how well they’re doing. Cell lines containing those genomes are publicly available as well, opening up many more possibilities.

The Consortium issued the first pilot genome in a bottle in 2015. It came from a Caucasian individual from Utah, whose genome had already been studied extensively. For the next genomes, GIAB members selected two families — one Chinese and one Ashkenazi Jewish — who are participants in the Personal Genome Project (PGP), which George Church initiated at Harvard in 2005. (Only one family member, the son of the Chinese trio, is currently available in the NIST catalog, as RM8393, although GIAB characterized his parents.) The PGP was an ideal partner, Zook and Salit explain, because its participants understand that confidentiality and privacy can’t be guaranteed when DNA data and whole genomes are made publicly available. The PGP also permits commercial innovation of new products based on these genomes. The project therefore requires its participants to demonstrate a thorough understanding of genomes and genomic information and to waive any expectations of privacy.

“Because we’re actually distributing DNA in a little tube — really a genome in a bottle — we wanted to use genomes of individuals who had consented in a very broad and very open manner,” Salit says. “The PGP uses the state-of-the-art in consent in that it’s really open and rigorous and transparent and allows the genome of a consented individual to be used and shared without restriction.”

Gold Standards

Leon Peshkin, a genome researcher and lecturer at Harvard’s Systems Biology Department, known in some circles by his PGP ID huAA53E0, is one of three human genome-in-a-bottle kits listed in the NIST catalog. His genome is available for purchase on its own or in combination with the genomes of both of his parents.

“My genome is definitely the best characterized public genome on the planet among human and any species’ genomes,” Peshkin says. “Well,” he adds, “there might be some private project we are unaware of, but other than that, the quality of my genome is likely at least 10 times, maybe 100 times, higher than the next best-characterized genome.”

When the PGP launched, Peshkin says he signed up enthusiastically. When the NIST later came to Church looking to decide who should become a genome in a bottle, he again didn’t hesitate. (It’s possible that the NIST didn’t even really need to ask, given the nature of the PGP waiver, but Peshkin says it did anyway.) While his genome is sold anonymously, Peshkin doesn’t mind outing himself. As a scientist, he knows just how little can be interpreted from a person’s genome today (with few exceptions), and how far the technology and the science still have to go. For him, the PGP and GIAB efforts were opportunities “to lead by example.” He believes that only by making millions of genomes and the corresponding medical and physiological information available can we eventually decipher what genomes mean.

It’s important because these are people who understand how to measure things well, and that’s what we’re focused on in trying to translate this incredibly revolutionary technology from the research world into the clinic.

It was also clear the Consortium would benefit most from trios, because the DNA shared among family members is useful for checking errors, and, more importantly, allowing researchers to see what was inherited from which parent and how individual variants interact when they get mixed in a child. There weren’t many trios to choose from. “It was a no-brainer,” Peshkin says.

After his genome became available, Pesh-kin at first paid a lot of attention to the ways it was being put to use. Ultimately, though, it wasn’t as exciting as it might sound. “They weren’t figuring out how long I’d live or what skills I have,” he says. “Nothing of that sort is happening now or [probably] will be for another 200 years.” He says that only after millions share their genomes will we get to something even remotely similar to such interpretations.

At conferences he attended, the talk instead revolved around which parts of the genomes could be sequenced and resolved with high-confidence and about how one technology did compared to another. Eight companies use 12 technologies in GIAB.

Those technical details might not be captivating on their own, but that doesn’t make them any less essential. Studies of the five genomes have produced insights, like the fact that more than 39,000 places in the genome are called incorrectly by at least one sequencing technology. (One study reported that only 75 to 80 percent of bases in the coding region of the genome are found in regions that can be called with high confidence.) Many of those mistakes can be found in public databases, and plenty of them occur in disease-related genes. There’s still ample room for improvement.

“It’s been really important for the field that NIST has taken genomics seriously,” says Euan Ashley, a professor of medicine at Stanford University Medical Center, where he has collaborated on the GIAB effort. “It’s important because these are people who understand how to measure things well, and that’s what we’re focused on in trying to translate this incredibly revolutionary technology from the research world into the clinic.

“If you’re interested in discovering a variant, and that’s the reason you are exome sequencing a group of people, then whether you succeed or fail is defined by whether you discover something new, and as long as you did, it doesn’t matter too much if there are a lot of other things you might have discovered if your technology were better,” he continues. “In the [clinical] world I live in, if you are doing a single test on a single patient and there is a specific number of genes you need to assay in exquisite detail, then missing 10 to 15 percent of those genes because [certain regions are] not covered well is a problem. It’s quite literally a question of whether you make the diagnosis or not in that individual. It could even be life or death.”

The GIAB genomes offer a kind of “ground truth” in genomics that had been lacking before, Ashley says. Stephen Kingsmore, president and CEO at the Rady Children’s Institute of Genomic Medicine in San Diego, has been putting the GIAB materials to work in his efforts to use rapid whole genome sequencing to diagnose children born with mysterious genetic conditions.

“We need to understand whether we have the right type of accuracy and sensitivity and we need a gold standard for that,” Kingsmore says. “We don’t want to miss a diagnosis because we’re insensitive or make a mistake and diagnose a condition that’s not actually there because we’re not accurate enough.”

Kingsmore and his colleagues bring the GIAB samples back out each and every time they change methods. Before a new machine can be put into operation, it needs to be optimized. That usually means they run a GIAB sample over and over again to make sure they have reproducible accuracy and sensitivity.

As for Peshkin, he doesn’t worry about having his DNA and genomic information out there and available for sale on the internet. For one, it’s too important to him to see the science advance. And, besides, he says, there’s already so much information about all of us out there that’s more easily interpretable than our genomes and is more readily usable to discriminate.

“If I wanted to be paranoid, I should be paranoid not about my genome but about [many] other things,” he says. “Like what I look like, how long my nose is, or what I say in this interview. … I’m an Ashkenazi Jew, who grew up in Russia. I know what discrimination means.”