Improving language understanding by generative pre-training
Learning/Understanding is compression - finding the mutual information between different concepts, reducing them down to a common low-dimensional representation. This is what learning is about.
- It is often more parameter-efficient to learn just one common representation of the same semantic concept (same content, different styles) -> what we want to guide the model to do.
- Abstract representations of the same semantics in multiple languages can be “linked” through common words in multiple sentences