Under the Skin: Big Data at Work
Since the futurist Alvin Toffler first warned the world of the looming “future shock” of information overload nearly five decades ago, scientists have been working overtime to meet both the challenges and opportunities of the digital age.
It’s no wonder. The payoff for successfully harnessing the power of big data can be huge. From national security experts teasing out the early signs of a potential threat to business leaders trying to beat their competitors to an emerging market, the stakes are high.
But if you’re really up for a challenge, take a fantastic voyage into the world of bioinformatics – one of the hottest fields applying big data analytics to real-world problems.
Researchers in bioinformatics develop tools to help make sense of the vast, complex and diverse data sets generated by studies in biological and medical science. Much of that data takes the form of strings of DNA that contain all of an organism’s genetic information.
In humans, that amounts to about 3 billion base pairs of the compounds adenine, guanine, cytosine and thymine — the rungs on DNA’s twisted ladder. The sequence of these bases provides the instructions needed for an organism to develop, survive and reproduce.
In his lab on Centennial Campus, plant pathologist David Bird leads the university’s new bioinformatics cluster, bringing together top researchers in genetics, statistics, computer science and biology. Meeting the challenges of big data has sparked “a philosophical change in the way we do science,” Bird says.
For one thing, researchers now have the computing power and the statistical tools to take on tasks that would have been impossible a decade ago, like subjecting a string of DNA to 10 billion tests. DNA sequencing gives researchers unprecedented insight into the inner works of the genome — explaining how proteins are made, identifying which mutations are linked to cancer risks, or showing how parasites interact with their hosts.
In Our Genes
“Some people use the analogy of looking for a needle in a haystack,” says statistician Fred Wright. “But that’s not what we’re doing. We’re actually looking for lots of needles in many, many haystacks.”
Wright, a member of NC State’s new bioinformatics faculty cluster and director of the Bioinformatics Research Center, is studying genetic variations in people with cystic fibrosis, an inherited disorder that causes severe damage to the lungs and digestive system.
“Even with modern medicine, some people with cystic fibrosis die at 15, and some live to 50,” Wright says. “It’s that variation that we’re trying to understand. What is it in the constitution of their DNA that allows some people to survive so long?”
To answer that question, Wright and collaborators at the UNC-Chapel Hill Cystic Fibrosis Center are conducting complex genetic profiling on thousands of cystic fibrosis sufferers — a data-crunching challenge that scientists have only recently been able to address thanks to fast, powerful computers. By comparing the genetic profiles of different people, the researchers are learning how the disease progresses.
“If we find variations that correlate to reduced lung function, then it becomes a matter of working with medical geneticists to understand how the genes may be interacting or mediating the immune system to cause the lung to become inflamed,” Wright explains. “The eventual hope is that there might be a drug target that could help fix the problem.”
Tracing Toxins
From the pesticides that protect crops to the pressurized fluid injected into shale formations to extract natural gas and petroleum, toxic chemicals pose a growing risk to people and the environment. Finding the genetic connection between toxins and diseases is crucially important — and enormously difficult.
“Why do people respond differently to the same environment toxins?” asks David Reif, a statistician and geneticist who joined NC State in 2013 after seven years at the U.S. Environmental Protection Agency. “If two people drink the same tap water, why does one person get sick while the other does not?”
The answer may lie in the genetic variations Reif studies. But even with all the computing power of a major research university at his fingertips to crunch vast amounts of data and churn out volumes of reports, Reif notes that computers don’t perform the most important function in science: thinking.
“The computer doesn’t solve the problem without instructions on where and how to look,” Reif says. “But it’s great at performing a simple task umpteen billion times without getting bored.”
Once computers have done their job of highlighting promising associations, Reif begins the challenging work of interpreting the data. The genetic pathways that lead from toxic exposure to physical illness are rarely marked with clear signposts. But researchers are just starting down the road when it comes to big data.
If it seems strange that statisticians are leading medical and environmental health research projects, it’s time to update your thinking. In the age of big data, health solutions are as likely to come from analytics as from traditional clinical trials.
“There’s been a change in the last decade,” says Bird, head of the bioinformatics cluster. “Statisticians are no longer just service people that you go to for help with your experiment. They’re now leading the discipline.”
A version of this story appeared in the fall 2014 issue of Results, the biannual magazine of research, innovation and economic development at NC State. Read the full story.