Nature Inventory: From Fortran, arXiv to AlexNet, the codes that changed the scientific world

From Fortran compilers to the arXiv preprint library to AlexNet, these computer codes and platforms have transformed the scientific world.

In 2019, the Event Horizon Telescope team took the first picture of a black hole. This image is not a photograph in the traditional sense, but a calculation—a mathematical transformation of data captured by multiple radio telescopes in the United States, Mexico, Chile, Spain and Antarctica. The team made the code publicly available so that the scientific community could see it and build on it for further exploration.

And this is gradually becoming a common pattern. From astronomy to zoology, computers are behind every great modern scientific discovery. Today’s laptops have 10,000 times the memory and clock speed of his lab computer in 1967, says Stanford University computational biologist Michael Levitt, winner of the 2013 Nobel Prize in Chemistry. “Today, we have a lot of computing power. But the problem is, it still requires human thinking.”

A powerful computer is useless without software that can handle the research problem and researchers who know how to write and use it. “Research is now closely related to software, and software has permeated all aspects of research,” said Neil Chue Hong, director of the Software Sustainability Institute.

A recent article in Nature sought to uncover the important code behind the scientific discoveries that have transformed the field of science over the past few decades. This article introduces ten software tools that have had a major impact on the scientific community, including Fortran compilers closely related to the field of artificial intelligence, arXiv, IPython Notebook, AlexNet, and more.

  Nature Inventory: From Fortran, arXiv to AlexNet, the codes that changed the scientific world

  Language Pioneers: The Fortran Compiler (1957)

The first modern computers to appear were not user friendly. Programming is actually done by hand, connecting rows of circuits with wires. Later machine and assembly languages ​​allowed users to program computers using code, but both languages ​​still required a deep understanding of computer architecture, preventing many scientists from using them.

This situation changed in the 1950s with the development of symbolic languages, especially the “formula translation” language Fortran. The Fortran language was developed by the team of John Backus at IBM. With Fortran, a user can program a computer with human-readable instructions such as x = 3 + 5, which a compiler then translates into fast and efficient machine code.

  Nature Inventory: From Fortran, arXiv to AlexNet, the codes that changed the scientific world

The CDC 3600 computer, programmed with a Fortran compiler, was moved to the National Center for Atmospheric Research in 1963. (Image credit: Alliance of Universities for Atmospheric Sciences/Science Image Library.)

In the early days, programmers used punch cards to enter codes, and complex simulations could require tens of thousands of punch cards. However, Fortran enables researchers who are not computer scientists to program. “For the first time, we programmed ourselves,” says Princeton University climatologist Syukuro Manabe, who and his colleagues used Fortran to develop one of the first successful climate models.

More than 60 years later, Fortran is still widely used in climate modeling, fluid dynamics, computer chemistry, and other disciplines that involve complex linear algebra and require powerful computers to crunch numbers quickly. Fortran code runs fast, and there are still many programmers who know how to write Fortran. Ancient Fortran codebases are still active in laboratories and supercomputers around the world.

  Signal Processors: Fast Fourier Transform (1965)

As astronomers scanned the sky, they caught a murmur of complex signals that varied over time. To understand the properties of these radio waves, they needed to see what these signals looked like as a function of frequency. A mathematical process called the Fourier transform allows scientists to achieve this. But the problem is that the Fourier transform is not efficient, it requires N operations for a dataset of size N.

In 1965, American mathematicians James Cooley and John Tukey developed a method to speed up the Fourier transform process. The fast Fourier transform (FFT) simplifies the problem of computing the Fourier transform to N log_2(N) with the help of recursion, a “divide and conquer” programming method in which the algorithm can be reused repeatedly. steps. The speed also increases as N increases. For 1000 points, the speed is about 100 times faster; for 1 million points, it is about 50,000 times faster.

Oxford University mathematician Nick Trefethen said the discovery of the FFT was actually a “rediscovery” because the German mathematician Karl Friedrich Gauss made it in 1805 but never published it. However, James Cooley and John Tukey pioneered the use of FFTs in fields such as digital signal processing, image analysis, and structural biology. Trefethen considers the FFT “one of the great discoveries in applied mathematics and engineering.” FFTs have been implemented in code many times, and one popular variant is FFTW (“Fastest Fourier Transform in the West”).

  Nature Inventory: From Fortran, arXiv to AlexNet, the codes that changed the scientific world

The Murchison Telescope, which uses fast Fourier transforms to collect data.

Paul Adams, director of the Molecular Biophysics and Integrative Bioimaging Division at Lawrence Berkeley National Laboratory, recalls that when he improved the structure of the bacterial protein GroEL in 1995, even using an FFT and a supercomputer, it required “Many, many hours, even days” calculations. But without the FFT, it’s hard to imagine how this would be done, and the time it would take would be immeasurable.

  Standard Interface for Linear Algebra Operations: BLAS (1979)

Scientific computing often involves mathematical operations using vectors and matrices, which are relatively simple but computationally expensive. In the 1970s, there was no generally accepted set of tools for performing such operations. Therefore, researchers have to spend time designing efficient code to do basic mathematical operations, resulting in the inability to focus on the scientific problem itself.

The programming world needs a standard. In 1979, the Basic Linear Algebra Subprograms (BLAS) came into being. Until 1990, the standard was still evolving, defining dozens of basic procedures covering vector and matrix operations.

Jack Dongarra, a computer scientist at the University of Tennessee and a member of the BLAS development team, said that BLAS actually simplifies matrix and vector operations into basic computing units like addition and subtraction.

  Nature Inventory: From Fortran, arXiv to AlexNet, the codes that changed the scientific world

Cray-1 supercomputer. (Image credit: Science History Images/Alamy)

“BLAS is probably the most important interface defined for scientific computing,” says Robert van de Geijn, a computer scientist at the University of Texas at Austin. In addition to providing standard names for commonly used functions, researchers can ensure that BLAS-based code works in the same way Run on any computer. The standard also enables computer manufacturers to optimize BLAS implementations for fast execution on hardware.

For more than 40 years, BLAS has represented the heart of the scientific computing stack, enabling scientific software to continue to evolve. Lorena Barba, a mechanical and aerospace engineer at George Washington University, calls BLAS “a core mechanism within five layers of code.”

 Preprint Platform: arXiv.org (1991)

In the late 1980s, researchers in the field of high-energy physics tended to mail their submissions to peer reviewers, a form of etiquette, but only to a select few. “Those at the bottom of the food chain rely on handouts from those at the top, which tends to completely exclude aspiring researchers from non-elite institutions from the privileged circle,” physicist Paul Ginsparg wrote in a 2011 article. wrote.

In 1991, Ginsparg at Los Alamos National Laboratory wrote an email autoresponder to level the playing field. Mail subscribers receive a daily list of preprints, each with an identifier. This allows users around the world to submit or retrieve papers from the aforementioned laboratory computer systems with a single email.

Ginsparg had originally planned to keep the article for three months, limiting the scope to the high-energy physics community, but his colleagues persuaded him to remove those restrictions. “It was at that moment that it went from bulletin board to archive,” Ginsparg said. After that, an influx of papers began to flow in far beyond Ginsparg’s expectations. In 1993, Ginsparg ported this system to the Internet. In 1998, he officially named the system arXiv.org.

Today, the 30-year-old arXiv includes 1.8 million preprint articles, all free to read, with more than 15,000 monthly paper submissions and 30 million monthly downloads. “It’s not hard to see why arXiv is so popular,” the editor of Nature Photonics once said. “The system provides researchers with a fast and convenient way to do research, telling everyone what you’re doing and when you’re doing it, eliminating the need for It eliminates the tediousness of peer review in traditional journals.”

  

The site’s success has also spurred the creation of similar repositories in other disciplines such as biology, medicine, and sociology, as exemplified by the release of thousands of preprints of coronavirus-related research.

 Data Explorer: IPython Notebook (2011)

Fernando Pérez decided to “search for procrastination” in 2001, when he was a graduate student and decided to adopt the core components of Python.

Python is an interpreted language, meaning that programs are executed line by line. Programmers can use a computational call-and-response facility known as a read-evaluate-output loop (REPL), where they can type code, which is then executed by the interpreter. REPLs allow for rapid exploration and iteration, but Pérez points out that Python wasn’t built for science. For example, it does not allow users to easily preload code Modules or keep data visualizations open. So Pérez created his own version.

In December 2001, Pérez released IPython, the interactive Python interpreter, with 259 lines of code. A decade later, Pérez teamed up with physicist Brian Granger and mathematician Evan Patterson to migrate the tool to a web browser, creating the IPython Notebook, which revolutionized data science.

Like other computational notebooks, IPython Notebook combines code, results, graphics, and text into a single document. But unlike other projects of this type, IPython Notebook is open source, welcomes contributions from developers in the community, and supports Python, a language commonly used by scientists. In 2014, IPython evolved into Project Jupyter, supporting around 100 languages ​​and allowing users to explore data on remote supercomputers as easily as they do on their own computer.

Nature noted in 2018 that “Jupyter has become a de facto standard for data scientists”. At that time, there were already 2.5 million Jupyter notebooks on GitHub, and now there are nearly 10 million, including records of the 2016 discovery of gravitational waves and the 2019 black hole imaging. “It would also make a lot of sense that we could contribute a little to these projects,” says Pérez.

 Fast Learner: AlexNet (2012)

Artificial intelligence (AI) can be divided into two categories, those that use coding rules and those that let computers “learn” by simulating the neural structure of the brain. “For decades, AI researchers have dismissed the second approach as ‘ridiculous,’” said Geoffrey Hinton, a computer scientist at the University of Toronto and a Turing Award winner. In 2012, Hinton’s graduate students Alex Krizhevsky and Ilya Sutskever proved otherwise.

At that year’s ImageNet annual competition, researchers were asked to train the AI ​​on a database of 1 million images of everyday objects, and then test the algorithm on another set of images. “At the time, the best algorithms would misclassify 1/4 of the images,” says Hinton. AlexNet, developed by Krizhevsky and Sutskever, is a neural network-based deep learning algorithm that reduces the error rate to 16%. “We almost cut the error rate in half,” Hinton said.

Hinton believes that the team’s success in 2012 reflected a combination of sufficiently large training data sets, excellent programming, and the new power of graphics processing units, originally designed to improve computer video performance. “Suddenly, we were able to make the algorithm 30 times faster, or learn 30 times more data,” he said.

Hinton said the real algorithm breakthrough actually happened three years ago. His lab at the time created a neural network that was more accurate at recognizing speech than conventional AI, which had been refined over decades. Although the accuracy has improved only slightly, it is worth remembering.

The success of AlexNet and related research has brought about the rise of deep learning in many fields such as laboratory and clinical. It enables phones to understand voice queries and allows image analysis tools to easily pick out cells from photomicrographs. That’s why AlexNet has a place in the tools that change science and change the world.

 In addition to the above projects, the codes selected for the list also include biological databases, atmospheric circulation models, image processing software NIH Image / ImageJ / Fiji and biological macromolecular sequence alignment search tool BLAST. Interested students can read the original text.

The Links:   LM32P073 2MBI1400VXB-120-50 TFT-Panel