cross entropy

Cross entropy between two distributions is

This should be interpreted as trying to code the events from using the "surprise" distribution of .

We are basically encoding using the wrong distribution of suprise, , instead of itself. This is why ideal cross entropy is when . This idea is called Shannon's source coding theorem.

By using the wrong distribution, it costs us more entropy, more bits to encode . The exact difference in bits is precisely kl divergence.