Key Takeaways

  • Investigators have developed a map that identifies non-coding regions of the human genome that lack typical variation, indicating that they are important sequences conserved during evolution and natural selection
  • The work will help scientists study genomic regions that, when mutated, may cause disease

BOSTON – Every human’s genome has millions of genetic variants, but most have little to no effect, making it difficult for clinicians to make medical diagnoses based on genetic differences.

Using patterns of variation from tens of thousands of individuals with whole-genome sequence data, a team led by investigators at Massachusetts General Hospital (MGH) and the Broad Institute of MIT and Harvard recently identified regions of the genome that lack typical variation, indicating that they are important sequences conserved during evolution and natural selection.

The authors of the study, which is published in Nature, note that when a variant arises in one of these regions, it’s more likely to have an effect on an individual’s health.

“We sought to examine how natural selection shapes patterns of human genetic variation across the whole genome, especially in the non-coding genome, which has been much less characterized than protein-coding regions,” says senior author Konrad Karczewski, PhD, an Assistant Professor in the Analytic and Translational Genetics Unit in the Department of Medicine at MGH and Associate Member of the Broad Institute of MIT and Harvard.

“While our previous work evaluated the 2% of the genome that encodes genes, our new metrics extend to the entire genome, greatly expanding our knowledge about which functional genomic elements likely harbor variation with potential clinical significance.”

Karczewski and his colleagues aggregated and processed information from 76,156 human genomes into the Genome Aggregation Database (gnomAD), a large international human genome reference resource that they have been expanding and releasing to the public continuously.

The variants in this database have been helping clinical labs worldwide perform diagnoses of rare diseases, and this release greatly expands the ability to do so in non-coding regions.

The team used the results to build a “genomic constraint map” for the whole genome (called Gnocchi, for Genomic NOn-Coding Constraint of HaploInsufficient variation). The map indicates which regions of the genome are “constrained,” meaning that when variants in the region occur, they are often too damaging and are removed from the population by natural selection.

The team found that constrained regions are enriched for regulatory elements (which control gene expression) and variants implicated in complex human diseases and traits.

The scientists also found that more constrained regulatory elements tend to regulate more constrained protein-coding genes, which in turn suggests that studying non-coding constraint can aid in the identification of constrained genes.

“We anticipate that Gnocchi could be used to prioritize genetic variation discovered in non-coding regions of the genome in patients with rare diseases, which can potentially provide clues for genetic causes of diseases and starting points for targeted therapeutics,” explains Karczewski.

Next, it will be important to add genomic information from other individuals into this newly developed dataset.

“Future efforts towards a larger, more diverse human reference dataset would further improve rare disease diagnoses for all, and create better powered constraint metrics, giving us a better understanding of the distribution and effects of human genetic variation,” says Karczewski.


Additional authors include Siwei Chen, Laurent C. Francioli, Julia K. Goodrich, Ryan L. Collins, Masahiro Kanai, Qingbo Wang, Jessica Alföldi, Nicholas A. Watts, Christopher Vittal, Laura D. Gauthier, Timothy Poterba, Michael W. Wilson, Yekaterina Tarasova,
William Phu, Riley Grant, Mary T. Yohannes, Zan Koenig, Yossi Farjoun, Eric Banks, Stacey Donnelly, Stacey Gabriel, Namrata Gupta, Steven Ferriera, Charlotte Tolonen, Sam Novod, Louis Bergelson, David Roazen, Valentin Ruano-Rubio, Miguel Covarrubias, Christopher Llanwarne, Nikelle Petrillo, Gordon Wade, Thibault Jeandet, Ruchi Munshi, Kathleen Tibbetts, Genome Aggregation Database Consortium, Anne O’Donnell-Luria, Matthew Solomonson, Cotton Seed, Alicia R. Martin, Michael E. Talkowski, Heidi L. Rehm, Mark J. Daly, Grace Tiao, Benjamin M. Neale, and
Daniel G. MacArthur.


Development of the Genome Aggregation Database was supported by the National Institute of Diabetes and Digestive and Kidney Diseases and the National Human Genome Research Institute of the National Institutes of Health.



About the Massachusetts General Hospital

Massachusetts General Hospital, founded in 1811, is the original and largest teaching hospital of Harvard Medical School. The Mass General Research Institute conducts the largest hospital-based research program in the nation, with annual research operations of more than $1 billion and comprises more than 9,500 researchers working across more than 30 institutes, centers and departments. In July 2022, Mass General was named #8 in the U.S. News & World Report list of "America’s Best Hospitals." MGH is a founding member of the Mass General Brigham healthcare system.