We propose a technique for coherently co-clustering uncalibrated views of a scene with a contour-based representation. Our work extends the previous framework, an iterative algorithm for segmenting sequences with small variations, where the partition solution space is too restrictive for scenarios where consecutive images present larger variations. To deal with a more flexible scenario, we present three main contributions. First, motion information has been considered both for region adjacency and region similarity. Second, a two-step iterative architecture is proposed to increase the partition solution space. Third, a feasible global optimization that allows to jointly process all the views has been implemented. In addition to the previous contributions, which are based on low-level features, we have also considered introducing higher level features as semantic information in the co-clustering algorithm. We evaluate these techniques on multiview and temporal datasets, showing that they outperform state-of-the-art approaches.