cross entropy
Cross entropy between two distributions is
\begin{equation}
H(p, q) = -\sum_{k = 1}^{K} p_K \log q_k
\end{equation}
This should be interpreted as trying to code the events from \(p_k\) using the "surprise" distribution of \(q_k\).
We are basically encoding \(p\) using the wrong distribution of suprise, \(q\), instead of \(p\) itself. This is why ideal cross entropy is when \(q = p\). This idea is called Shannon's source coding theorem.
By using the wrong distribution, it costs us more entropy, more bits to encode \(p\). The exact difference in bits is precisely kl divergence.