This paper presents a strategy for estimating the geometry of an interest object from a monocular video sequence acquired by a walking humanoid robot. The problem is solved using a space carving algorithm, which relies on both the accurate extraction of the occluding boundaries of the object as well as the precise estimation of the camera pose for each video frame. For data acquisition, a monocular visual-based control has been developed that drives the trajectory of the robot around an object placed on a small table. Due to the stepping of the humanoid, the recorded sequence is contaminated with artefacts that affect the correct extraction of contours along the video frames. To overcome this issue, a method that assigns a fitness score for each frame is proposed, delivering a subset of camera poses and video frames that produce consistent 3D shape estimations of the objects used for experimental evaluation.