By Mathew in Transformer architecture — 16 Jan 2019

The Evolution of Deep Learning: 2019's Breakthroughs in NLP, AutoML, and AI Applications

Introduction

In a comprehensive lecture at MIT, Lex Fridman offered a thoughtful exploration of the state of deep learning as it stood in early 2019. Rather than simply cataloging benchmark results, Fridman's presentation focused on the fundamental ideas and developments that were defining the cutting edge of this rapidly evolving field.

"Here we stand in 2019 really at the height of some of the great accomplishments that have happened. But also stand at the beginning. And it's up to us to define where this incredible data-driven technology takes us," Fridman explained, setting the stage for a wide-ranging discussion that covered theoretical breakthroughs, practical applications, and community developments.

This talk came at a pivotal moment in AI history—following years of remarkable progress in computer vision, just as natural language processing was experiencing its own revolutionary moment, and as deep reinforcement learning was achieving feats previously thought impossible. While not covering every development (medical applications and protein folding, for instance, were not addressed), Fridman's lecture provided a valuable snapshot of where deep learning stood and where it might be heading.

The NLP Revolution: 2018 as the "ImageNet Moment" for Language

According to Fridman, 2018 will be remembered as the breakthrough year for Natural Language Processing (NLP). He described it as the "ImageNet moment" for the field, drawing parallels to how AlexNet transformed computer vision in 2012 by demonstrating the power of purely learning-based methods.

The Path to BERT

The journey to this NLP breakthrough followed several critical developments:

Encoder-Decoder Architecture: This fundamental approach takes a sequence of words as input, encodes it into a fixed vector representation using recurrent units (like LSTMs or GRUs), and then decodes that representation into an output sequence that can be of different length—essential for tasks like machine translation.
Attention Mechanism: Rather than collapsing an entire input sequence into a single vector, attention mechanisms allow the decoder to selectively focus on different parts of the input sequence during generation. As Fridman explains: "It provides a mechanism that allows to look back at the input sequence... You're allowed to look back at the particular samples from the input sequence as part of the decoding process."
Self-Attention: This innovation expanded the attention concept by allowing the encoder to look at other parts of the input sequence when forming representations. "It allows you to determine for certain words what are the important relevant aspects of the input sequence that can help you encode that word the best," Fridman noted.
Transformer Architecture: Building on self-attention, the transformer uses self-attention in both the encoder and decoder, capturing rich contextual information without recurrence.
Word Embeddings Evolution: Traditional Word2Vec embeddings were enhanced by ELMo's bi-directional LSTMs, which looked at context in both directions around a word to create richer representations.
OpenAI Transformer: This approach leveraged the transformer's language modeling capabilities, allowing for transfer to specific tasks.

BERT: The Breakthrough

BERT (Bidirectional Encoder Representations from Transformers) marked the significant leap forward in NLP. What made it different was its "richly bi-directional" approach:

"With BERT, it's richly bi-directional—it takes in the full sequence of the sentence and masks out some percentage of the words, 15% of the words, 15% of the samples of tokens from the sequence. And tasks the entire encoding self-attention mechanism to predict the words that are missing," Fridman explained.

By stacking multiple encoders and self-attention layers, BERT learned rich contextual representations that could be fine-tuned for numerous language tasks—from classification and similarity detection to question answering and sentence tagging. This versatility and performance improvement represented the most significant NLP breakthrough of the year.

Neural Networks in the Real World: Tesla Autopilot

Stepping away from theoretical advances, Fridman highlighted one of the most significant practical applications of deep learning: Tesla's Autopilot system.

"Tesla has a system called Autopilot where the hardware version 2 of that system is a newer implementation of the NVIDIA Drive PX 2 system which runs a ton of neural networks," Fridman explained. The system processes input from eight cameras at various resolutions through a variant of the Inception network architecture, performing critical perception tasks like drivable area segmentation and object detection.

What makes this application particularly significant is that these neural networks are operating "in the wild"—controlling vehicles driven by everyday consumers who often have no understanding of the technology's capabilities or limitations.

"Now it has a neural network... controlling the life of a human being. And that to me is one of the great breakthroughs of 17 and 18 in terms of the development of what AI can do in a practical sense in impacting the world," Fridman emphasized.

By early 2019, vehicles had accumulated over 1 billion miles using Autopilot, with approximately half of those miles driven using the neural network-based Hardware Version 2 system—a system that continually learns and improves through regular updates.

Automating Machine Learning: AutoML Developments

The dream of AutoML—automating the design and optimization of machine learning systems—saw significant advances in 2017-2018. Fridman highlighted several key developments:

AdaNet: AutoML with Ensembles

AdaNet represented an evolution of neural architecture search methods, using reinforcement learning to build ensembles of neural networks rather than single architectures:

"What is doing here is given candidate architectures, stitching them together to form an ensemble to get state-of-the-art performance," Fridman explained. While the performance improvements weren't revolutionary, they represented another step toward fully automated machine learning pipelines.

AutoAugment: Deep RL for Data Augmentation

Fridman noted that while architecture development has seen tremendous innovation since 2012, data augmentation techniques have been relatively neglected—a gap that AutoAugment began to address:

"AutoAugment is just a step, a tiny step into that direction that I hope that we as a community invest a lot of effort in," Fridman remarked. The system uses reinforcement learning to discover optimal data augmentation policies—combinations of basic transformations like translation, rotation, and color manipulation—that maximize model performance.

One of the most interesting findings was that augmentation policies learned on one dataset (like ImageNet) could be transferred to other datasets, similar to how neural network weights can be transferred. This represented a new dimension of transfer learning focused on data processing rather than model parameters.

"You can transfer as part of the transfer learning process, take the data augmentation policies learned on ImageNet, and transfer those. You can transfer both the weights and the policies," Fridman explained, calling it "a really super exciting idea."

Learning from Synthetic Data

Another area seeing innovation was the use of synthetic data for training deep neural networks. Fridman highlighted NVIDIA's work in this space, where they created highly diverse synthetic datasets by systematically varying backgrounds, lighting conditions, object positions, and more—often creating scenes that "can't possibly happen in reality."

"NVIDIA is really good at creating realistic scenes. And they said, 'okay, let's create realistic scenes but let's also go away aboveboard and not do realistic at all. Do things that can't possibly happen in reality,'" Fridman noted.

The researchers showed that networks trained on these synthetic datasets could achieve strong performance, especially when fine-tuned with a small amount of real data. This approach offers another path to "learn a lot from a little" by generating rich, diverse training data synthetically.

Making Annotation Easier: Polygon-RNN++

For supervised learning tasks like segmentation that require pixel-level annotation, creating training datasets can be extraordinarily labor-intensive. The Polygon-RNN++ project addressed this challenge by using neural networks to assist in the annotation process.

After a user draws a bounding box around an object, the system uses convolutional neural networks to place the first point of a polygon, then employs recurrent neural networks to draw the rest of the polygon around the object. This dramatically reduces the time and effort required to create segmentation masks.

"The dream with AutoML is to remove the human from the picture as much as possible," Fridman explained. "With data augmentation, remove the human from the picture as much as possible for menial data. Automate the boring stuff, and in this case, the act of drawing a polygon—try to automate it as much as possible."

Making Deep Learning Fast and Affordable: DAWNBench

An important trend Fridman highlighted was the push to make deep learning more accessible through faster, more cost-efficient training methods. Stanford's DAWNBench benchmark challenged researchers to achieve specific accuracy thresholds (93% on ImageNet, 94% on CIFAR-10) with minimal training time and cost.

The fast.ai team, described by Fridman as "a renegade awesome group of deep learning researchers," achieved remarkable results—training ImageNet to 93% accuracy in just 3 hours for $25, and CIFAR-10 to 94% accuracy for only 26 cents.

Their key innovation was a simple but effective learning rate manipulation technique: "They found that if they crank up the learning rate while decreasing the momentum, which is a parameter of the optimization process, and they do it jointly, they're able to make the network learn really fast."

This development has profound implications for democratizing AI research: "That's exactly for people sitting in this room that opens up the door to doing all kinds of fundamental deep learning problems without the resources of Google, DeepMind, or OpenAI or Facebook... That's important for academia, that's important for independent researchers."

Advances in Generative Models

BigGAN: Scaling Image Synthesis

For generative adversarial networks (GANs), 2018 was more about scaling existing techniques than developing new architectural innovations. Google DeepMind's BigGAN focused on increasing model capacity and batch size, producing stunningly realistic high-resolution images.

"It produces incredible images. I encourage you to go online and look at them. It's hard to believe that they're generated," Fridman remarked. "So 2018 for GANs was a year of scaling and parameter tuning as opposed to breakthrough new ideas."

Video-to-Video Synthesis

NVIDIA made significant advances in video generation with their video-to-video synthesis approach. The key innovation was making temporal consistency part of the optimization process:

"The idea with video-to-video synthesis... is to make the temporal consistency, the temporal dynamics part of the optimization process. So make it look not jumpy," Fridman explained.

The system could transform various inputs—like semantic segmentation maps, edge drawings, or body poses—into realistic, temporally consistent video. Unlike previous image-to-image approaches that created noticeable frame-to-frame jitter when extended to video, NVIDIA's approach maintained coherence over time.

Semantic Segmentation: DeepLabv3+

In the realm of image understanding, semantic segmentation—assigning a class label to every pixel in an image—represents one of the most complex perception tasks. Fridman highlighted the evolution of segmentation architectures from early fully convolutional networks to the state-of-the-art DeepLabv3+.

The key innovation in DeepLabv3+ was its multi-scale processing approach using "atrous" (dilated) convolutions: "Without increasing the parameters, the multi-scale is achieved by the 'atrous rate.' So taking those atrous convolutions and increasing the spacing... You can consider all these different scales of processing and looking at the layers of features."

By varying the dilation rate, the model could effectively capture both fine details and broader context without increasing computational complexity, leading to state-of-the-art results on benchmarks like PASCAL VOC.

Deep Reinforcement Learning: From Games to Real-World Applications

Some of the most impressive AI achievements in recent years have come from deep reinforcement learning, starting with DeepMind's DQN paper showing superhuman performance on Atari games using only raw pixel inputs.

AlphaZero: Learning Through Self-Play

Building on the success of AlphaGo, which combined expert supervision with self-play to defeat the world champion in Go, DeepMind developed AlphaGo Zero and later AlphaZero—systems that learned entirely through self-play, without human examples.

"Alpha Zero will be a thing that people will remember as an interesting moment in time, as a key moment in time," Fridman predicted. With just four hours of training (albeit on highly distributed systems), AlphaZero learned to beat Stockfish, the leading chess engine, as well as the top Shogi engine, Elmo.

What made AlphaZero particularly fascinating was how it resembled human cognition more than traditional game-playing algorithms:

"If you look at the way human grandmasters think, it certainly doesn't feel like they're looking down a tree. There's something like creative intuition, there's something like you can see the patterns in the board... And Alpha Zero is moving closer and closer towards the human grandmaster, considering very few future moves," Fridman explained.

While traditional chess engines like Stockfish analyze billions of potential future positions, AlphaZero relied more on neural network evaluation of board quality, examining orders of magnitude fewer positions—more like a human expert would.

OpenAI Five: Tackling Team-Based Games

On the messier end of the gaming spectrum, OpenAI took on the challenge of Dota 2, a complex team-based game that requires coordination, long-term planning, and dealing with imperfect information.

After achieving a milestone in 2017 by beating a top professional player in 1v1 matches, OpenAI attempted to tackle the full 5v5 game in 2018. Though they lost two games against professional teams at The International 2018 (Dota 2's premier tournament), the attempt itself pushed the boundaries of what AI systems could handle.

"Dota 2 and this particular video game makes it currently... really two games that have the public eye in terms of AI taking on as benchmarks," Fridman noted, with the other being poker—specifically team Texas No-Limit Hold'em.

Maturation of Deep Learning Frameworks

Fridman highlighted how 2018 marked a maturation of deep learning frameworks and ecosystems, with TensorFlow and PyTorch both releasing significant 1.0 versions that standardized deep learning practices.

"Those two players have made incredible leaps in standardizing deep learning. In the fact that a lot of the ideas I talked about today and Monday and we'll keep talking about all have a GitHub repository with implementations in TensorFlow and PyTorch, making [them] extremely accessible, and that's really exciting," he noted.

Looking Forward: Reimagining Deep Learning

To conclude, Fridman quoted Geoffrey Hinton, often called the "Godfather of deep learning," who recently suggested that the field might need to "throw it all away and start again." Hinton believes that backpropagation—the fundamental algorithm underlying most deep learning achievements—is "totally broken" and needs to be completely revolutionized.

Hinton suggested that "the future depends on some graduate student who's deeply suspicious of everything I've said," a perspective that Fridman found appropriate for closing a discussion on the state of the art in deep learning.

"Everything we're doing is fundamentally based on ideas from the 60s and the 80s, and really in terms of new ideas, there has not been many new ideas. Especially the state-of-the-art results that I've mentioned are all based fundamentally on stochastic gradient descent and backpropagation," Fridman observed. "It's ripe for totally new ideas. So it's up to us to define the real breakthroughs and the real state of the art 2019 and beyond."

Key Points

2018 marked the "ImageNet moment" for NLP, with BERT's bidirectional approach to language modeling delivering breakthrough performance across numerous language tasks.
Neural networks entered everyday life through systems like Tesla Autopilot, where they directly impact human safety and have accumulated over a billion miles of real-world experience.
AutoML expanded beyond architecture search to include automated data augmentation strategies and ensemble construction, moving toward fully automated machine learning pipelines.
Learning efficiency improved through innovations in synthetic data generation, annotation assistance tools, and training optimization techniques that made deep learning more affordable and accessible.
Reinforcement learning systems demonstrated increasingly human-like reasoning in games like chess and Go, while beginning to tackle more complex team-based games requiring coordination and long-term planning.
Deep learning frameworks matured and standardized, creating robust ecosystems that make cutting-edge techniques accessible to researchers and practitioners worldwide.
Despite impressive progress, deep learning remains fundamentally reliant on decades-old algorithms like backpropagation and stochastic gradient descent, suggesting potential for revolutionary new approaches in the future.

For the full conversation, watch the video: