When there is both audio and video track, then fragment
creation is driven by video track (a combination of duration
so far + next key frame). But if there is no video track, then
the duration so far drives the fragment creation.
Due to bug, when there is only audio track, only first
fragment was created as expected and then a new fragment is
created for every audio sample.
#cherrypick
PiperOrigin-RevId: 731257696