The ability to capture depth information form an scene has greatly increased in the recent years. 3D sensors, traditionally high cost and low resolution sensors, are being democratized and 3D scans of indoor and outdoor scenes are becoming more and more common.

However, there is still a great data gap between the amount of captures being per- formed with 2D and 3D sensors. Although the 3D sensors provide more information about the scene, 2D sensors are still more accessible and widely used. This trade-off between availability and information between sensors brings us to a multimodal scenario of mixed 2D and 3D data.

This thesis explores the fundamental block of this multimodal scenario: the reg- istration between a single 2D image and a single unorganized point cloud. An unorganized 3D point cloud is the basic representation of a 3D capture. In this representation the surveyed points are represented only by their real word coordi- nates and, optionally, by their colour information. This simplistic representation brings multiple challenges to the registration, since most of the state of the art works leverage the existence of metadata about the scene or prior knowledges.

Two different techniques are explored to perform the registration: a keypoint-based technique and an edge-based technique. The keypoint-based technique estimates the transformation by means of correspondences detected using Deep Learning, whilst the edge-based technique refines a transformation using a multimodal edge detection to establish anchor points to perform the estimation.

An extensive evaluation of the proposed methodologies is performed. Albeit further research is needed to achieve adequate performances, the obtained results show the potential of the usage of deep learning techniques to learn 2D and 3D similari- ties. The results also show the good performance of the proposed 2D-3D iterative refinement, up to the state of the art on 3D-3D registration.