Shaping the Future of the NIST Mass Spectral Library
Tytus Mak, Statistician, National Institute of Standards and Technology (NIST)
When one thinks of mass spectral libraries — well, let’s be honest here, the very idea of mass spectral libraries will never cross most people’s minds, even most scientists’. But that’s true for most of the seemingly esoteric things that many of us at the National Institute of Standards and Technology (NIST) have devoted our careers to. And though many of us have often struggled to explain the nature of our work to friends and loved ones (and more often than not failed to elicit excitement), the importance of NIST’s work cannot be overstated. This is very much true for what we call the NIST Standard Reference Database 1A, more commonly known as the NIST Mass Spectral Library.
This year marks the eighth release of the library, which includes over 2 million mass spectra measured for over 350,000 chemical compounds. A mass spectrum is essentially a fingerprint generated by analyzing a chemical with an instrument called a mass spectrometer. When trying to identify an unknown substance, acquiring and comparing its mass spectrum to reference spectra in a library is one of the quickest and most accurate ways to do it. This sort of analysis is routinely performed in thousands of labs across the world in a wide range of different industries, many of which rely on the NIST Mass Spectral Library.
The library has been continuously updated for over 40 years, with three-year release cycles. A common question that’s asked whenever we release a new version of the library is “What’s new?” It is indeed a perfectly reasonable question, but a vexingly difficult one to answer considering we add spectra from upward of 7,000 new chemical compounds each year. The kind of answer that people expect also varies depending on what field they’re in, and what their interests are.
However, an even more common, and typically unasked, question in the back of people’s minds is “Why?” It’s an almost existential question, perhaps even mildly off-putting one for those of us who have helped build the library over the years, decades, and — for some of us — our entire scientific careers. However, it’s certainly a question that merits contemplation, especially for those of us who are so close to the library’s development.
Updating the database
Having joined NIST in 2014, I am a relative newcomer to the Mass Spectrometry Data Center (MSDC), the group that is responsible for maintaining, curating and expanding the NIST Mass Spectral Library. Despite my relative inexperience, I have nonetheless been given a significant role in the library’s expansion, which is to select the aforementioned 7,000 or so chemical compounds each year that will be acquired and analyzed, and whose spectra will eventually be added to the library. With tens of millions of chemical compounds available for purchase, how does one go about selecting a mere fraction of a fraction of these for analysis? Why even bother trying to “select” compounds anyway; why can’t we just go down the list by alphabetical order and buy as many as we can handle?
The simple answer is that most of the compounds for sale are simply irrelevant because they are produced for large-scale drug discovery studies by biopharmaceutical companies. Most of these compounds would not be found in nature and are simply of no interest to anyone else, at least unless the chemical ends up being a useful drug, so there’s little reason to add them to our library. Thus, the task at hand is to select compounds that people do care about.
This is an exceptionally wide umbrella, covering fields such as forensics, the flavor and fragrance industry, wastewater treatment, food science, biomedical research and agriculture, to name a few. Being at the helm of this selection process has been a huge learning experience for me, not only in terms of understanding the intricacies of our library building process, but also the broad impact the NIST Mass Spectral Library has had in such a wide range of critical fields and will continue to have in the rapidly emerging fields like big data-driven biosciences that focus on the analysis of biological molecules like DNA and proteins.
It is the latter that I am particularly interested in, in no small part because I (and other new members of the MSDC) feel a responsibility to uphold the legacy of the library by keeping it relevant in our rapidly changing world. It is my peculiar educational background and research interests that I rely on to guide me in this process.
Learning the trade
Being that I’m in a group that is so steeped in the world of mass spectrometry (it’s in our name after all!), many are surprised to learn that I am not a trained chemist, much less a trained mass spectrometrist. My official job title at NIST is “statistician,” and though that’s not quite an accurate representation of my skill set, it is nonetheless the closest thing you can get to my area of expertise in the list of federal job titles. My preferred job title would be bioinformatician, someone who designs computational tools and algorithms to analyze biological data.
I got my start in the field when I was just 16 years old. Almost by pure coincidence I happened to land a high-school internship at the Center for Advanced Research in Biotechnology (CARB), which is now known as the Institute for Bioscience and Biotechnology Research (IBBR), a research institute in Rockville, Maryland, that many NISTers are quite familiar with, as it is a joint partnership between NIST and the University of Maryland. I certainly didn’t have an inkling that I’d end up working at NIST decades down the road, and in the division that is most closely involved with the IBBR, too!
The lab that I picked to spend the summer before my senior year at was (unbeknownst to me at the time) quite renowned in the field of computational biology, headed by Professor John Moult, and I whimsically chose to work in it because I had never fathomed putting computers and biology together before. Fortunately, it was the right place at the right time for me because the year was 2003, and the Human Genome Project had just wrapped up, revealing over 22,300 protein-coding genes littered across a staggering 3.3 billion base pairs, the molecules that, if DNA were a spiral staircase, make up the individual steps. Processing this overwhelming amount of information, otherwise known as genomics, necessitated a whole new class of scientist, one who was knowledgeable in both molecular biology as well as computer science.
As an intern, my primary task was constructing diagrams of the evolutionary relationships among organisms (phylogenetic trees), and while I could only grasp the very basics of what I was doing, I knew I was hooked and sought to make a career out of it. I earned my bachelor’s degree in electrical engineering and entered graduate school to pursue a Ph.D. in bioinformatics. However, things had changed by 2009 when I was just starting my first lab rotation. There was a new “-omics” field that was emerging called metabolomics, which promised to be as transformative as genomics was, and I just so happened to be in a lab that specialized in it.
Adding metabolomic ‘genes’
While genomics focuses on analyzing the totality of information at the genetic level of an organism, metabolomics focuses on analyzing the totality of information at the metabolic level. Thus, genomics and metabolomics operate on opposite ends of Francis Crick’s central dogma of molecular biology, wherein DNA is transcribed into RNA, which is translated into proteins, many of which catalyze chemical reactions that sustain life, otherwise known as metabolism. My thesis focused on the development of new computational methods for analyzing metabolomics data, which happened to pique the interest of Stephen E. Stein, mass spectrometry guru, NIST Fellow and father of the NIST Mass Spectral Library.
At NIST I continue to pursue my interests in metabolomics, not just in the development of new algorithms, but in enhancing the mass spectral library to make it an invaluable resource for the field. By prioritizing the addition of compounds from databases including the Human Metabolome Database and Chemical Entities of Biological Interest, we are adding critical puzzle pieces analogous to genes in the human metabolome to the library every year. In doing so, I believe the library is akin to the Human Genome Project in its foundational importance to the field of metabolomics.
This post originally appeared on Taking Measure, the official blog of the National Institute of Standards and Technology (NIST) on July 7, 2020.
To make sure you never miss our blog posts or other news from NIST, sign up for our email alerts.
About the Author
Tytus Mak is a statistician in the Mass Spectrometry Data Center at the National Institute of Standards and Technology (NIST). His research focuses on developing machine learning approaches for analyzing big datasets generated by high-throughput biomolecular analysis platforms including metabolomics, glycomics and proteomics. He received his B.S. in electrical and computer engineering at Cornell University and his Ph.D. in bioinformatics and tumor biology at Georgetown University. He is also an avid martial artist and spends his free time practicing karate and kenjutsu.