“Data, data, data”
For computational biologist Bonnie Berger, SM ’86, PhD ’90, the explosion of new genomic information offers a gold mine of opportunities.
Professor Bonnie Berger has always had a knack for learning languages. Growing up in Miami, she could understand and speak Spanish and also studied Hebrew, becoming fluent after spending just over three months in Israel. She can read and write—and, to some extent, converse—in Russian after only two years of studying it in college. And “I always spoke math and computer science,” says Berger, SM ’86, PhD ’90, who is the Simons Professor of Mathematics and leads the Computation and Biology group at the Computer Science and AI Lab (CSAIL). Today, she’s fluent in genomic data on top of it all. “I’m good at languages, so the language of biology was something I could pick up as well,” she says.
Being a polyglot pays dividends. With the advent of DNA sequencing and other techniques, she has been applying the tools of computer science and mathematics to tease meaningful insights out of a deluge of biological information. As she describes it, the theme of her work is “data, data, data.”
Berger has adapted the page-rank algorithm used by search engines to predict inherited similarities across species. She has used code-breaking strategies to predict protein structures and applied computational techniques to drug discovery. She has used language models to assess how readily SARS-CoV-2 variants will evade the immune system and employed topology to predict virus assembly and misassembly.
Berger’s research spans at least 10 different subfields, including comparative genomics, algorithms, bioinformatics, cryptography, population genetics, protein and RNA structure, drug target interactions, cancer research, virology, and one or two of her own invention. Her publication record in each one would be sufficient for an impressive career. Making an impact in so many diverse areas is extraordinary. But the way Berger sees it, she has simply gravitated to where the interesting problems are. She thinks of the result as “hybrid disciplines coming together to inform each other.”
“My work is a mash-up,” she says. “Everything informs everything else to make it better.”
Bonnie Berger’s advice to young scientists
On time management:
“Allot four hours a day where you’re going to work with no interruptions—don’t look at your phone, don’t look at your email or go on the internet. Sit in a coffee shop and hyperfocus.”
Be open to many areas. “You never really know where things are going to go. They may come together to gel into better science.”
Don’t talk about your latest, best ideas until your work is done.
On public speaking
Berger divides her talks into modules, which she mixes and matches and wraps in a narrative depending on the audience. She adds a new module at each new talk. Her two rules of thumb: Don’t start with your newest module, and memorize the beginning of your talk to overcome nervousness. “Know those first few slides cold, and then you’re off and running,” she says.
To young scientists, especially women
“Early in your career, just get your work done.” Don’t play politics, don’t piss people off, and don’t let anybody put you on crazy committees. “Don’t be very vocal and be a problem until you’re tenured,” she says. “Don’t try to please the world. Don’t try to do anything but get fantastic research done. That’s all that matters.”
On saying no
Early on, don’t say yes to things, and never right away. Say “I’ll get back to you” and think about how it fits in. But if you say no, do it quickly.
“You just have to,” she says. “Find a good undergrad. And that person benefits too; they learn so much.”
On starting a lab
“Lab 101: Take nice people. Take people that will be supportive of each other.” One of the strengths of her lab, she says, is that she hires such people across the entire spectrum of expertise—when you have a question, there’s always someone who will help.
To MIT Students
The humanities at MIT are a “hidden gem.”
Take time off
Allow yourself two weeks in the winter and two in the summer, and take a scheduled day off every weekend. Turn off your phone—“Throw it in the sea!”—and close your computer. “You will last a lot longer and you will feel a lot more invigorated to work,” she says. “It doesn’t help to work every day.”
An ongoing collaboration with two other MIT professors on a Biden administration moonshot cancer grant is a case in point. Timothy Lu ’03, MEng ’03, PhD ’08, a professor of biological engineering and EECS and a researcher at the Broad Institute, kicks things off by measuring the impact of making genetic changes to single tumor cells. Then Berger analyzes the resulting data, predicts how changes in gene expression affect cancer development, and passes those predictions along to biology professor and Koch Institute researcher Ömer Yilmaz, who validates them in studies on mice and human organoids. “We go round and round,” Berger says. “We all inform each other in a cycle.”
Berger says she is driven by curiosity: “I only work on what gets me excited.” Inspired by connections between different areas, she sees their converging paths and draws from each of their toolboxes. “It’s not so much about making a career as it is about focusing on the work,” she says. “And the work is more meaningful and on target because it’s influenced by different perspectives.”
“All Bergers do math”
Berger, whose father was a businessman and classically trained pianist while her mother served as head of the Jewish Education Service of North America, entered kindergarten in the mid-1960s, a time when girls were often discouraged from pursuing mathematics. But her father was eager to pass on his dream of becoming a mathematician and began slipping math problems under her door when she was in grade school. He’d start with a note: “Good morning, Bonnie! Would you like to do math problems today?” She would check yes or no, and if the answer was yes, she’d tackle the day’s problem set: sequences and series, for example, with some of the numbers provided and some left for her to fill in. “If I wanted math problems, of course I should have math problems!” she says.
“She’s a Berger,” her father would say. “All Bergers do math!”
In her first two years of high school, Berger worked through math packets and tests at her own pace, making it all the way through calculus. Sitting in the back of the classroom with juniors and seniors, she helped them with their work and “learned about life” in exchange. But in her junior year she transferred to a prep school, where she was required to repeat her math classes and was one of only two girls taking calculus—until the other girl left. When she saw a male classmate using the school’s punch card computer, she was excited to try it herself but wasn’t allowed.
As an undergraduate at Brandeis, Berger had to take calculus a third time—and chose to take it pass/fail. After getting perfect scores all semester without studying, she remembers running to the professor in a panic after missing the final exam. Though she had already passed the class, she insisted on taking the final, imploring, “I need to prove to myself that I can do this!” She walked out with another perfect score, less than an hour into the three-hour test.
Berger declared majors first in Russian and then in psychology, but she discovered a love of coding in a sophomore class on statistical programming in Fortran and switched to the school’s new computer science major. In 1981, her junior year, she spent $800 of her own money on a 1,200-baud modem so that she could code through the night from her shared apartment, feeding her code into the mainframe. She credits her professors at Brandeis with encouraging her to consider graduate school. “Even though I got discouraged out of math,” Berger says, “I came back to it.”
At MIT, Berger studied computer science under the cryptography pioneer and future Turing Award winner Silvio Micali and informal mentor Peter Shor, a professor of applied mathematics and winner of this year’s Killian Award. After winning an award for her doctoral thesis, she began her postdoc at the Institute; one year in, she made the switch to biology—or, rather, began mining the field for interesting problems to work on. Daniel Kleitman, a professor of applied mathematics and Berger’s postdoc advisor, had heard Stanford biophysicist (and future Nobel laureate) Michael Levitt speak about protein folding. And just like the businessman in The Graduate who urged Dustin Hoffman’s character to pursue plastics, Kleitman “was so enthralled that he came back and said to me: ‘Proteins!’” Berger recalls. “‘That’s what you should do.’” She smiled at his movie reference, and decided she was game.
Proteins need to fold into three-dimensional shapes in order to become biologically active, and the exact shape matters, since the same protein folded in different ways can do different things. How a protein is folded affects such things as which binding sites are exposed and how it interacts with other molecules. The big debate at the time was whether or not proteins had intermediate states while folding. Berger worked on predicting the presence and functions of so-called coiled coils, strings of amino acids that are twisted together like two telephone cords within a protein; they play roles in gene expression and in stabilizing links between proteins. Having been introduced to cryptography in Shafi Goldwasser’s class on the subject and by her peers in Micali’s lab, Berger wondered whether a technique that uses the frequency of character pairs and triplets to break codes could be adapted to analyze protein sequences. Before getting a lab tech to help her, Berger spent hours upon hours comparing protein sequences by hand to look for patterns. She figured out how to predict, on the basis of those sequences, which proteins would form coiled coil structures. The resulting paper and its follow-ups have now been cited 2,000 times, and this work has allowed researchers to do things like predict, as biology professor Peter Kim has done, how an influenza virus binds to a cell membrane through a spring-loaded mechanism. Software programs Berger’s group has since developed have been used by thousands to identify coiled coils and their functions, she says.
Berger also worked with Jonathan King in the biology department and Shor to study viral capsids, the three-dimensional shells that protect viruses and help them get inside their host cells. She predicted that the formation of these structures, which are composed of repeating protein subunits, must follow local rules: “One protein is like a lock into a key that binds to the next protein, which changes its shape.” Her rules described how viral capsids could self-assemble from a given number of different proteins.
Comparing mice and men
Berger joined the MIT faculty in 1992 as an assistant professor of applied mathematics, with a joint appointment in the Laboratory for Computer Science (LCS). Soon after, she began working on the first comparison of the human and mouse genomes with her students Serafim Batzoglou ’96, MEng ’96, PhD ’00, and Lior Pachter, PhD ’99, along with future Broad Institute researcher Eric Lander. After Lander suggested looking at whole-genome comparisons, Berger, Pachter, and Batzoglou came up with an algorithm that made it possible to align the genomes of two different species. Then they worked with Lander to map the genes between human and mouse, starting by aligning large matching windows and then drilling down to smaller and smaller matches within them. In 2000, they published the first paper on comparative genomics, demonstrating that coding regions of the genome were on average 80% identical between the two species. This work, says Berger, launched the subfield of comparative genomics.
Berger and Batzoglou also helped set the stage for sequencing the human genome. Using the shotgun sequencing protocol, researchers had split up the genome into millions of random DNA fragments, which they then sequenced individually. Then they needed software tools that could look for overlaps in the sequences so they could reassemble them in the correct order. Berger and Batzoglou developed the prototype for one such tool, called the Arachne sequence assembler. After earning his PhD, Batzoglou worked for a year at the Whitehead Institute, where he and a colleague turned that prototype into a production-quality tool Lander’s lab used extensively as part of the large collaborative effort to assemble the first human genome. That work would help earn Batzoglou a spot on Technology Review’s list of Top 100 Young Technology Innovators in 2003, following in the footsteps of Berger, who made the inaugural TR100 list in 1999.
“Taking a network view allows us to analyze different kinds of data through a common lens.”
Berger’s group extended this human-mouse comparison work to look at fruit flies and 18 species of yeast, the focus of Manolis Kellis ’99, MEng ’99, PhD ’03, who was then a student and is now a fellow MIT professor. That research led Berger and Rohit Singh, PhD ’12, to develop software called Isorank, which aligns genome sequences from different species using a ranking algorithm similar to the page-rank system in search engines: two regions are likely to be a good match if their neighbors are alike, and their neighbors’ neighbors, and so on. This makes it feasible to integrate disparate types of data—for example, sequence alignments, protein-protein interactions, or genetic interactions—to find genes with common ancestry and function in different species. “Our idea,” says Berger, “was that taking a network view allows us to analyze different kinds of data through a common lens.”
Isorank led to her more recent Mashup software, which pulls insights about the roles various genes play in both healthy and diseased organisms by drawing on even more heterogeneous sources of information: transcriptomic data (the messenger RNA molecules generated from the genome), proteomic data (the proteins), gene expression data, pharmacogenomic data, experimental data, genomic data from different species, “whatever is out there.” This work allows Berger to predict functions: determining whether two genome regions are involved in signaling or cancer pathways, for example, or whether or not they interact.
Berger explains that condensing a mosaic of complex data (what she calls “high-dimensional data”) into a simpler representation that still captures the important variation allows you to make that data interpretable and, therefore, useful. Recently, she and collaborators applied this approach to research on Parkinson’s disease. Working with yeast that had been engineered to express proteins associated with the disease, they mapped known Parkinson’s-related genes in the yeast model to the corresponding human genes. Now, she’s working on predicting protein-protein interactions in under-examined species that have little obvious similarity to fruit flies, mice, and other better-studied organisms.
Compressing genomes and protecting privacy
Together with two of her math students, Po-Ru Loh, PhD ’13, and Michael Baym, PhD ’09 (both now professors at Harvard Medical School), Berger invented methods to compress DNA sequence data, drug molecule data, metagenomic data, or protein sequence data in such a way that it could be fed into analysis software without being decompressed—a technique she calls compressive genomics. The key is that the evolutionary tree of life is not too bushy. Genomic sequences cluster in very dense groups of near-identical sequences, and groups that have little in common tend to be far from each other—so each group can at least initially be summarized by a representative. “This reduced representation allows you to compute directly on the compressed data, and then hop around a little and gather everything else up,” Berger says. Once the data has been compressed, it can be run through any existing tool without modifying the tool itself. Berger’s team was able to accelerate decades-old sequence-comparison algorithms by two orders of magnitude while retaining more than 99% accuracy.
Berger also developed a framework for sharing sensitive biological data such as human genomes across different institutions while retaining privacy, using a technique called multi-party computation that was well known in cryptography but had not yet been applied to biology. Biology presented new problems in scalability that needed to be solved as well.
“The data’s growing enormously. It’s been very difficult to handle, and it will keep growing, but it has huge opportunities.”
Working with her EECS students Brian Hie, SM ’19, PhD ’21, and Hyunghoon Cho, PhD ’16 (now fellows at Stanford and the Broad Institute, respectively), Berger used this framework to allow pharmacological companies to securely pool their data on interactions between drugs and drug targets without revealing it to each other. The shared but encrypted data from all participating companies trains a model; individual companies can then query the model to learn whether or not a drug targets a protein of interest. This system can help the group of companies repurpose drugs more efficiently than any one company could do on its own.
Validating the drug-target interactions the team produced for the paper, for example, cost $4,000; Berger estimates that testing all such interactions without the model’s help in narrowing down candidate pairs would have cost hundreds of thousands of dollars. “The data’s growing enormously. It’s been very difficult to handle and it will keep growing, but it has huge opportunities,” she says.
Berger problem sets
With her students and collaborators, Bonnie Berger has worked on a wide range of problems in disparate fields. Here are just a few.
- Integrating single-cell data across studies, addressing batch effects, noise, and different formats
- Identifying rare cell types in single-cell data, and the genes that are important in each rare cell type
- Predicting cancer-driving mutations in noncoding regions of the genome from genomic and epigenomic data
- Developing a new, faster programming language that operates on compressed biological data
Identifying repression by microRNAs of repeat-rich coding regions of the genome
- Using machine learning to help interpret data sets from cryogenic electron microscopy
- Predicting secondary structures of RNA and proteins
Mad Libs for viruses
Recently, Berger invented a tool for predicting how effective various strains of influenza, HIV, SARS-CoV-2, and other viruses will be at evading the immune system. Her method, which she has dubbed “Mad Libs for viruses,” repurposes language models, which usually predict the probability that particular sequences of words will appear in a sentence. Berger’s language models are trained on existing protein sequences; unlike other methods of inferring viral functionality, they don’t require multiple sequence alignments between new strains and known ones.
Running the model can tell you how a new variant or virus fits into what the model learned from previous viruses. The last layer of the model describes the syntax: if the protein is not going to fold, won’t bind to cell membrane proteins, or is not able to infect the cell, for example, it is “not grammatically correct.” The second-to-last layer tells you the semantics—as Berger puts it, “Is this so far different from the original viral strain that it will escape antibody recognition?” Together, the syntax and semantics tell you whether a new variant or virus has the potential to be especially dangerous.
Whereas in Mad Libs blank spaces in a sentence are filled by nouns, verbs, adjectives, or adverbs, Berger’s software swaps out subsets of the virus’s amino acids. When parts that get slotted into the model prove to be grammatically incorrect, it suggests that they pose little danger. But those that are grammatically correct yet semantically very different from the original have the potential to be problematic. “To have a really funny Mad Lib,” Berger says, “you need enough change in meaning.” (In terms of viruses, of course, a funny Mad Lib is anything but funny—it’s likely to escape an immune system trained on previous strains.)
Berger used her viral language models to help the Coalition for Epidemic Preparedness Innovations (CEPI) determine that the “deltacron” covid variant, a SARS-CoV-2 virus derived from parts of both delta and omicron, has immune escape “semantics” almost identical to those of the highly transmissible omicron. In another project, she used the language models to help the Centers for Disease Control and Prevention predict potential future variants with capacity for immune escape. She is also working with CEPI to predict future variants in the interest of developing a comprehensive covid-19 vaccine. And she has used the language models to predict a universal antibody against SARS-CoV-2 variants, which has since been verified in the lab.
A multilinguist in the world
Berger says that often her students lead her in new directions. The possible paths are many: her lab includes applied mathematicians, pure mathematicians, algorithmicists, MD-PhDs, biologists, computational biologists, and computer scientists.
In matching her students with problems to work on, Berger focuses on finding out what they enjoy: “I listen to them,” she says. “I see what they’re good at, who they are. I will never let them work on something they’re not excited about.”
Ellen Zhong, PhD ’22, for example, pursued research in cryogenic electron microscopy, which was new for the lab. In the spring, Zhong was fielding offers for faculty positions she says she would not have applied for without Berger’s encouragement: “It was all thanks to Bonnie.” As she starts her own lab, Zhong hopes to replicate the way Berger trusts and supports her students and finds insight into their interests and career goals. Berger, she says, “believes in her students.”
Ibrahim Numanagić, a former postdoc in Berger’s lab who is now a professor at the University of Victoria, says: “Whatever positive you can think of her, that’s it and even more.” Kellis, who now teaches alongside her at MIT, describes Berger as a wonderful mentor: “She really cares deeply about our success and career, for decades after graduation,” he says, noting that she “always provides important advice, resources, connections, and the right balance of help and independence to let her students develop into independent and collaborative researchers.”
“Early in your career, just get your work done.” Don’t play politics, don’t piss people off, and don’t let anybody put you on crazy committees.
Of her many accolades, Berger is most proud of being named to the National Academy of Sciences and the American Academy of Arts and Sciences, as well as receiving the senior scientist award at International Society for Computational Biology (ISCB). “It might seem like I have all these awards now,” she says, “but they didn’t happen until more recently.” For much of her career, women simply were not recognized, and Berger was almost always one of the few women in the room, if not the only one. Early on, “Planet Math” (as she calls the math field) was hostile to women; at MIT, she found shelter at LCS and CSAIL, where she met Goldwasser, Nancy Lynch, Barbara Liskov, and a community of “amazing women I could learn from and be mentored by.”
She also got critical encouragement from Florence Ladd, the head of the Radcliffe Bunting Institute, when she was a fellow there in 1992-’93. Just as Berger began doubting that she could maintain her career if she had kids, Ladd urged her to persist: “She said, ‘Look how successful you’ve been. You can’t drop all this. You can make it work.’” So Berger, who is married to Tom Leighton, a professor of applied mathematics and cofounder and CEO of Akamai Technologies, did just that, and weathered what she describes as a midcareer gap while raising her children: “I was picking my kids up at school and wanting to be in the car, just so I heard what the scoop was for the day. And I published less during that time, but it’s all fine.” In 2009, she says, when her youngest was 11, “I just revved up again.”
Berger has since worked to increase representation of women in her fields by mentoring female students within and outside her lab. As ISCB vice president, she ran workshops on gender inequality in 2016 and pushed to include more women (who then accounted for only four out of 44 fellows) and underrepresented minorities among fellows and awardees. “I think it’s gotten much better,” she says, going so far as to call 2022 “the year of the woman.”
In the coming years, it will not be surprising if Berger adds yet more fields, more techniques, and more exploration to her repertoire. Biology is generating unprecedented amounts of data, a potential gold mine for an endlessly curious, multilingual researcher like her. The future, she says, will continue to require flexibility. “The amount of data and the kind of data has absolutely changed, and will keep changing,” she says. “And you have to be willing to move with it.”
The inside story of how ChatGPT was built from the people who made it
Exclusive conversations that take us behind the scenes of a cultural phenomenon.
Sam Altman invested $180 million into a company trying to delay death
Can anti-aging breakthroughs add 10 healthy years to the human life span? The CEO of OpenAI is paying to find out.
ChatGPT is about to revolutionize the economy. We need to decide what that looks like.
New large language models will transform many jobs. Whether they will lead to widespread prosperity or not is up to us.
GPT-4 is bigger and better than ChatGPT—but OpenAI won’t say why
We got a first look at the much-anticipated big new language model from OpenAI. But this time how it works is even more deeply under wraps.
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.