This thesis explores a multimodal approach to Hate Speech detection, involving vision and language (text). More specifically, we deal with the context of memes, a form of internet humour which will present additional challenges. We first gather meme data from different sources. This way, we create a hate memes dataset for this task. Then, we use this data for the training and evaluation of statistical models, which are based on state-of-the art neural networks. We study different ways to fine-tune pretrained descriptors for our specific task. We propose a way to add expert knowledge into the system and orient it into a real world issue-solving system. We also discuss ways to deal with the issue of reduced amount of data, experimenting with a self-supervised learning approach for pre-training. We also compare the effect or contribution of each modality in the overall performance of the model.