DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion

Arch

  • Main Idea

    • Propose an end-to-end deep learning approach for estimating 6-DoF poses of known objects from RGB-D inputs.

    • DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated.

    • The core of our approach is to embed and fuse RGB values and point clouds at a per-pixel level.

    • Propose an iterative method which performs pose refinement within the end-to-end learning framework. This greatly enhances model performance while keeping inference speed real-time.

    • Robustness in highly cluttered using dense fusion method and run in (~ 16 FPS).

    • Used in a real robot task, where the robot estimates the poses of objects and grasp them to clear up a table.

    • Propose refinement module can be trained jointly with the main architecture and only takes a small fraction of the total inference time.


  • Contribution

    1. Present a principled way to combine color and depth information from the RGB-D input. Augment the information of each 3D point with 2D information from an embedding space learned for the task and use this new color-depth space to estimate the 6D pose.

    2. Integrate an iterative refinement procedure within the neural network architecture to improves the pose estimation while achieving near real-time inference, this method removed the dependency of previous methods of a post-processing ICP step.


  • Model

    The architecture contains two main stages:

    1. The first stage takes color image as input and performs semantic segmentation for each known object category. Then, for each segmented object, feed the masked depth pixels (converted to 3D point cloud) as well as an image patch cropped by the bounding box of the mask to the second stage.

    2. The second stage processes the results of the segmentation and estimates the object’s 6D pose. It comprises 4 components:

      • 2.1) Fully convolutional network that processes the color information and maps each pixel in the image crop to a color feature embedding.

      • 2.2) A PointNet-based, network that processes each point in the masked 3D point cloud to a geometric feature embedding.

      • 2.3) A pixel-wise fusion network that combines both embeddings and outputs the estimation of the 6D pose of the object based on an unsupervised confidence scoring.

      • 2.4) An iterative self-refinement methodology to train the network in a curriculum learning manner and refine the estimation result iteratively.

    model


  • Data and Metrics

    • Dataset

      • YCB-Video
      • LINEMOD
    • Evaluation Metrics

      • Instance-Level Pose Estimation

        • ADD (on LINEMOD)
        • ADD-S AUC (on YCB-Video)

  • Result

1. Result on the YCB-Video Dataset

res

2. Result on the LINEMOD Dataset

res


  • Limitation and Futur work

    • Limitation
      • This method does not consider the correlation within and between the RGB and depth modalities to fully exploit the consistent and complementary information from them to learn discriminative features for object pose estimation.