How is the softmax computed since there are seq_length x seq_length values per batch element?
The softmax is performed on w.r.t the last axis (torch.nn.Softmax(dim=-1)(tensor) where tensor is of shape batch_size x seq_length x seq_length) to get the probability of attending to every element for each element in the input sequence.
Let's assume, we have a text sequence "Thinking Machines", so we have a matrix of shape "2 x 2" (where seq_length = 2) after performing QK^T.
I am using the following illustration (reference) to explain self-attention computation. As you know, first scaled-dot-product is performed QK^T/square_root(d_k) and then softmax is computed for each sequence element.
Here, Softmax is performed for the first sequence element "Thinking". The raw score of 14 and 12 is turned into a probability of 0.88 and 0.12 by doing softmax. These probability indicates that the token "Thinking" would attend itself with 88% probability, and the token "Machines" with 12% probability. Similarly, the attention probability is computed for the token "Machines" too.

Note. I strongly suggest reading this excellent article on Transformer. For implementation, you can take a look at OpenNMT.