G2L-Net: Global to Local Network for Real-time 6D Pose Estimation with Embedding Vector Features

Arch

  • Main Idea

    • Propose a novel real-time 6D object pose estimation framework runs at over 20fps (~23fps).

    • G2L-Net decouples the object pose estimation into three sub-tasks: global localization, translation localization and rotation localization with embedding vector features.

    • Network can better capture the viewpoint information with the proposed point-wise embedding vector features (EVF).

    • G2L-Net achieves state-of-the-art performance in terms of both accuracy and speed


  • Contribution

    1. Propose a novel real-time framework to estimate 6D object pose from RGB-D data in a global to local (G2L) way. Due to efficient feature extraction, the framework runs at over 20fps on a GTX 1080 Ti GPU.

    2. Propose orientation-based point-wise embedding vector features (EVF) which better utilize viewpoint information than the conventional global point features.

    3. Propose a rotation residual estimator to estimate the residual between predicted rotation and ground truth, which further improves the accuracy of rotation prediction.


  • Model

    1. First, extract the coarse object point cloud from the RGB-D image by 2D detection.

    2. Second, feed the coarse object point cloud to a translation localization network to perform 3D segmentation and object translation prediction.

    3. Third, via the predicted segmentation and translation, transfer the fine object point cloud into a local canonical coordinate, in which train a rotation localization network to estimate initial object rotation.

    Note: In the third step, define point-wise embedding vector features to capture viewpoint-aware information. To calculate more accurate rotation, adopt a rotation residual estimator to estimate the residual between initial rotation and ground truth, which can boost initial pose estimation performance.

    model


  • Data and Metrics

    • Dataset

      • YCB-Video
      • LINEMOD
    • Evaluation Metrics

      • Instance-Level Pose Estimation

        • ADD (on LINEMOD)
        • ADD-S AUC (on YCB-Video)

  • Result

1. Result on the YCB-Video Dataset

res

2. Result on the LINEMOD Dataset

res


  • Limitation and Futur work

    • Limitation

      • G2L-Net relies on a robust 2D detector to detect the region of interest.

      • While network exploits viewpoint information from the object point cloud, the texture information is not well adopted.