G2L-Net: Global to Local Network for Real-time 6D Pose Estimation with Embedding Vector Features

Arch

Main Idea
- Propose a novel real-time 6D object pose estimation framework runs at over 20fps (~23fps).
- G2L-Net decouples the object pose estimation into three sub-tasks: global localization, translation localization and rotation localization with embedding vector features.
- Network can better capture the viewpoint information with the proposed point-wise embedding vector features (EVF).
- G2L-Net achieves state-of-the-art performance in terms of both accuracy and speed

Contribution
1. Propose a novel real-time framework to estimate 6D object pose from RGB-D data in a global to local (G2L) way. Due to efficient feature extraction, the framework runs at over 20fps on a GTX 1080 Ti GPU.
2. Propose orientation-based point-wise embedding vector features (EVF) which better utilize viewpoint information than the conventional global point features.
3. Propose a rotation residual estimator to estimate the residual between predicted rotation and ground truth, which further improves the accuracy of rotation prediction.

Model
1. First, extract the coarse object point cloud from the RGB-D image by 2D detection.
2. Second, feed the coarse object point cloud to a translation localization network to perform 3D segmentation and object translation prediction.
3. Third, via the predicted segmentation and translation, transfer the fine object point cloud into a local canonical coordinate, in which train a rotation localization network to estimate initial object rotation.
Note: In the third step, define point-wise embedding vector features to capture viewpoint-aware information. To calculate more accurate rotation, adopt a rotation residual estimator to estimate the residual between initial rotation and ground truth, which can boost initial pose estimation performance.

1. Result on the YCB-Video Dataset

res

2. Result on the LINEMOD Dataset

res

Limitation and Futur work
- Limitation
  - G2L-Net relies on a robust 2D detector to detect the region of interest.
  - While network exploits viewpoint information from the object point cloud, the texture information is not well adopted.

Main Idea