For the fusion layer, we have explored two straightforward approaches, as depicted in Figures 2 and 3. Figure 2 provides an overview of Framewise addition-based fusion, which capitalizes on the linear relationship between the lengths of the representations. It uses subsampling to ensure that both representations are of equal length before performing frame-level addition. On the other hand, Figure 3 demonstrates the utilization of cross-attention to merge the representations. This approach is not dependent on the lengths of the representations and can accommodate representations of any size. Further information regarding these approaches can be found in our paper.