PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

pose

  • Main Idea

    • Propose a novel Convolutional Neural Network (CNN) for end-to-end 6D pose estimation named PoseCNN.

    • A key idea behind PoseCNN is to decouple the pose estimation task into different components, which enables the network to explicitly model the dependencies and independencies between them.

    • Propose two new loss functions [PoseLoss PLoss, ShapeMatch-Loss SLoss ] are introduced for rotation estimation, with the ShapeMatch-Loss designed for symmetric objects

    • PoseCNN is able to handle occlusion and symmetric objects in cluttered scenes.


  • Contribution

    1. Propose a novel convolutional neural network for 6D object pose estimation named PoseCNN. Our network achieves end-to-end 6D pose estimation and is very robust to occlusions between objects.

    2. Introduce ShapeMatch-Loss, a new training loss function for pose estimation of symmetric objects.

    3. Contribute a large scale RGB-D video dataset (YCB-Video) for 6D object pose estimation, where we provide 6D pose annotations for 21 YCB objects.


  • Model

    1. Predicts an object label for each pixel in the input image.

    2. Estimate the 2D pixel coordinates of the object center by predicting a unit vector from each pixel towards the center. Using the semantic labels, image pixels associated with an object vote on the object center location in the image. In addition, the network also estimates the distance of the object center. Assuming known camera intrinsics, estimation of the 2D object center and its distance enables us to recover its 3D translation T.

    3. Estimate the 3D Rotation R by regressing convolutional features extracted inside the bounding box of the object to a quaternion representation of R. The 2D center voting followed by rotation regression to estimate R and T can be applied to textured/texture-less objects and is robust to occlusions since the network is trained to vote on object centers even when they are occluded.

    model


  • Data and Metrics

    • Dataset

      • YCB-Video
      • Occluded-LINEMOD

      • Generate 80,000 synthetic images for training on both datasets by randomly placing objects in a scene.

    • Evaluation Metrics

      • Instance-Level Pose Estimation

        • ADD-(s)

  • Result

res

1. Result on the YCB-Video Dataset
2. Result on the Occluded-LINEMOD Dataset

  • Limitation and Futur work

    • Limitation

      • The SLOSS sometimes results in local minimums in the pose space similar to ICP.
    • Future work

      • Explore more efficient way in handle symmetric objects in 6D pose estimation in the future.