Meta bets on self-supervised learning for human-level AI

Yann LeCun, Meta’s principal AI scientist, keeps his long-term aim in mind even when discussing immediate actions. LeCun told IEEE Spectrum, “We want intelligent devices that learn like animals and people.” This is the story about the crave for self-supervised learning AI, how close we are from it, and how should we worry.

Meta bets on self-supervised learning for human-level AI

You may also like: The Moravec Paradox – Artificial Intelligence Will Never Replace Humans

Meta, formerly Facebook, has published a series of articles on self-supervised learning (SSL) for AI systems. SSL contrasts with supervised learning, where AI learns from labeled data (the labels serve as the teacher who provides the correct answers when the AI system checks its work). LeCun believes SSL is a precondition for AI systems that can develop “world models” and gain humanlike faculties such as reason, common sense, and the ability to transfer skills and information. New articles describe how a masked auto-encoder (MAE) trained to recreate pictures, video, and audio from incomplete input. MAEs aren’t new, but Meta has expanded their use.

LeCun thinks the MAE system must be building a world model by predicting missing image, video, or audio data. “If it can anticipate what will happen in a movie, it must comprehend that the environment is three-dimensional, that certain items are lifeless and don’t move, that other objects are animate and harder to forecast, all the way up to anticipating complex behavior from animate persons,” he says. Once AI has a good world model, it can plan actions.

LeCun: “Intelligence is learning to predict.” While he doesn’t say Meta’s MAE system is AI, he sees it as a step forward.

Not everyone thinks Meta researchers are headed in the right direction. Yoshua Bengio developed deep neural networks along with LeCun and Geoffrey Hinton. He and LeCun sometimes fight over significant AI ideas. Bengio explains their differences and parallels to IEEE Spectrum.

Bengio writes that present methodologies (self-supervised or not) aren’t enough to reach human-level intelligence. He adds that “qualitative advancements” are needed to get human-scale AI.

Bengio agrees with LeCun that the ability to reason about the world is fundamental to intelligence, but his team focuses on models that can represent knowledge in natural language. Such a model “would let us combine information to address new problems, undertake counterfactual simulations, or investigate possible futures,” he says. Bengio’s team built a modular neural-net framework, unlike LeCun’s end-to-end learning architecture (models that learn all the steps between the initial input stage and the final output result).


Meta’s MAE builds on transformer neural network architecture. Transformers were originally utilized in natural-language processing, where they improved Google’s BERT and OpenAI’s GPT-3. Meta AI researcher Ross Girshick claims computer-vision experts “worked hard” to emulate transformers’ linguistic achievement.

Meta’s researchers weren’t the first to apply transformers to visual tasks; Girshick believes Google research motivated them. “By adopting the ViT architecture, it removed hurdles to experimenting,” he tells Spectrum.

Girshick coauthored Meta’s first static-image MAE work. Its training was similar to BERT’s. Such language models are fed vast datasets of text with some words “masked.” After the models anticipate the missing words, the missing text is unmasked so they may examine their work, tweak their parameters, and attempt again. Girshick adds that the team divided up photos into patches, masked some of the patches, and asked the MAE algorithm to forecast the missing bits.

One of the team’s discoveries was realizing that masking a big part of the image offered the best results, unlike language transformers, which mask 15 percent of the words. Girshick: “Language is a dense, efficient communication system.” “Every sign has meaning. Images, which are natural messages, aren’t designed to eliminate redundancy. He says that’s why JPG photos compress so nicely.

By masking 75% of an image’s patches, Girshick says, they remove duplication that would make training too easy. Their two-part MAE system employs an encoder to learn pixel connections in the training data set, then a decoder to reconstruct original images from masked versions. After training, the encoder is deleted and the decoder is fine-tuned for categorization and object detection.

Girshick says they’re interested about transfer learning to downstream activities. “We’re seeing extremely large gains” using the decoder for object recognition, he says. Scaling up the model improved performance, which is promising for future models because SSL “can use a lot of data without manual annotation.”

Meta’s technique for strengthening SSL may be self-supervised learning on enormous uncurated data sets, but it’s contentious: AI ethics researchers like Timnit Gebru have pointed out the biases in huge language models’ uncurated data sets, with fatal outcomes.

Video and audio self-learning

In the MAE video system, masking masked up to 95% of each frame since video signals have more redundancy than static graphics. Christoph Feichtenhofer, a Meta researcher, thinks that video is computationally expensive, which benefits MAE. MAE decreases computing cost by 95% by masking 95% of each frame, he explains.

These trials employed short films, but Feichtenhofer says training AI on lengthier videos is “current research.” Imagine a virtual helper with a camera feed of your house who can tell you where you left your keys. (Whether you think that’s fantastic or disturbing, it’s unlikely.)

Both the picture and video algorithms might be used for content monitoring on Facebook and Instagram. “Integrity” is one prospective application, says Feichtenhofer. We’re talking to product teams, but it’s new and we don’t have any actual initiatives.

The Meta AI team devised a creative solution to disguise audio MAE, which will shortly be posted on the arXiv preprint server. They transformed sound recordings into spectrograms, visual representations of signal frequencies, and masked training images. Reconstructed audio is remarkable, however the model can only handle short pieces.

Bernie Huang, who worked on the audio system, says potential uses include classification tasks, helping with voice over IP calls by filling in lost audio, and compressing audio files more efficiently.

Meta has been on a self-supervised learning AI charm offensive, open-sourcing MAE models and a pretrained big language model for research. Meta’s key business algorithms that regulate newsfeeds, suggestions, and ad placements are not available for scrutiny, opponents say.

This AI says it’s conscious and experts are starting to agree. w Elon Musk.

Leave a Comment