Original Source Here
1. Self-supervised learning is a rebranded term, rather than a new method.
In SOCML 2017, in a small room of people, including Bill Dally and @goodfellow_ian, I raised a question.
“Is word2vec supervised, or unsupervised? It is supervised in a sense that we punish wrong predictions during the training but the corpus is actually not labeled. So it’s unsupervised as well.”
No one in the room answered my question saying “it’s called self-supervised learning!” (fair enough, Lecun wasn’t there). Self-supervised learning wasn’t a popular term 3 years ago. I believe LeCun used to call it predictive unsupervised learning. In retrospect, word2vec, BERT, XLM, etc. all fall into the umbrella of self-supervised methods, but when they were published, none of the authors advertised them as such.
[For the people who aren’t familiar with the term ‘self-supervised’ learning]
It’s ‘self’ in a sense that you use your own un-labeled training data for supervision. For example, in language models, you predict the word that comes next given a sentence and compare that prediction with the actual word in corpus.
2.The blog admires NLP’s discreteness for making problems tractable (as NLP has only a finite number of possible predictions) and being very apt for predictive architectures.
However, from my experience, discreteness is a double-edged sword. It’s also the very reason that makes the NLP problem (especially generation) “not work well” or “hard to control” compared to CV. For example, NLG tends to output obviously wrong tokens for slight perturbations because of its discrete nature. They are very sensitive. On the other hand, small perturbations are not really perceptible in images or audio (continuous signals). Is it really necessary to enumerate all possible candidates during prediction and associate a score to each of them?
3. This blog was somehow publicized as “thanks to self-supervised learning, we don’t need to label the data anymore”. However, there was zero discussion in the blog about how one can skip annotations in downstream tasks! (which I don’t think possible atm; intelligence != autonomy). We never needed labeling for pretraining from the beginning. So what’s new here?
I think the meat of this article is Lecun suggesting promising directions in SSL, e.g. noncontrastive EBM. However, without a paradigm shift, we’ll still be using SSL as the same old pretraining and label learning for downstream tasks.
Along this line, the things that I’d love to hear from Yann (or anyone) are:
• different ways to use this approximated form of common sense
• the better formulation of SSL that could bypass supervision in downstream tasks,
• how we could use SSL for grounding
• a new formulation for conducting multiple tasks, etc.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot