Improving language understanding by generative pre-training

Learning/Understanding is compression - finding the mutual information between different concepts, reducing them down to a common low-dimensional representation. This is what learning is about.

It is often more parameter-efficient to learn just one common representation of the same semantic concept (same content, different styles) -> what we want to guide the model to do.
Abstract representations of the same semantics in multiple languages can be “linked” through common words in multiple sentences