Steganography is the technique of hiding secret data within an ordinary, non-secret, file or message in order to avoid its detection. Throughout our work, we study the case where the hidden secret data is an image and the non-secret data or cover signal is an audio. To this end, we use a recently proposed residual architecture operating on top of short-time discrete cosine transform (STDCT) audio spectrograms. In our work, we evaluate the above mentioned residual steganography architecture with the Localized Narratives dataset, explore the feasibility of using short-time fourier transform (STFT) audio spectrograms instead of STDCTs to improve the efficiency of the system, investigate the use of hidden signals permuted with the objective to spread the audio corruption of the revealed images, apply averaged audio windows to improve quality results and tested the system in real-world distortions.