Attention mechanism has been successfully used in multiple tasks in the fields of Computer Vision and Natural Language Processing, but has not ever been applied to 3D reconstruction problems. In this work, we explore the potential of attention in a multi-view 3D face recon- struction pipeline. On one hand, we use spatial attention when extracting the features of the input images, taking advantage of the interpretability it provides us. This allows us to validate the proper behaviour of the model. On the other hand, we want to make this multi-view setup invariant to the order of the input image?s views. To do so, instead of concatenating the fea- tures of the different views, we use part of the Transformer architecture as a symmetric merging function, which is based on a multi-head self-attention mechanism, showing an improvement in the performance.