G2L-Net: Global to Local Network for Real-time 6D Pose Estimation with Embedding Vector Features
-
Main Idea
-
Propose a novel real-time 6D object pose estimation framework runs at over 20fps (~23fps).
-
G2L-Net decouples the object pose estimation into three sub-tasks: global localization, translation localization and rotation localization with embedding vector features.
-
Network can better capture the viewpoint information with the proposed point-wise embedding vector features (EVF).
-
G2L-Net achieves state-of-the-art performance in terms of both accuracy and speed
-
-
Contribution
-
Propose a novel real-time framework to estimate 6D object pose from RGB-D data in a global to local (G2L) way. Due to efficient feature extraction, the framework runs at over 20fps on a GTX 1080 Ti GPU.
-
Propose orientation-based point-wise embedding vector features (EVF) which better utilize viewpoint information than the conventional global point features.
-
Propose a rotation residual estimator to estimate the residual between predicted rotation and ground truth, which further improves the accuracy of rotation prediction.
-
-
Model
-
First, extract the coarse object point cloud from the RGB-D image by 2D detection.
-
Second, feed the coarse object point cloud to a translation localization network to perform 3D segmentation and object translation prediction.
-
Third, via the predicted segmentation and translation, transfer the fine object point cloud into a local canonical coordinate, in which train a rotation localization network to estimate initial object rotation.
Note: In the third step, define point-wise embedding vector features to capture viewpoint-aware information. To calculate more accurate rotation, adopt a rotation residual estimator to estimate the residual between initial rotation and ground truth, which can boost initial pose estimation performance.
-
-
Data and Metrics
-
Dataset
- YCB-Video
- LINEMOD
-
Evaluation Metrics
-
Instance-Level Pose Estimation
- ADD (on LINEMOD)
- ADD-S AUC (on YCB-Video)
-
-
-
Result
1. Result on the YCB-Video Dataset
2. Result on the LINEMOD Dataset
-
Limitation and Futur work
-
Limitation
-
G2L-Net relies on a robust 2D detector to detect the region of interest.
-
While network exploits viewpoint information from the object point cloud, the texture information is not well adopted.
-
-
- pdf | code | Presentation