NOTES Of 6D Pose Estimation Papers

RGB based 6D Pose Estimation

Regress 6D pose directly.
- Sensitive to small errors due to large search space.
- Non-linearity of the rotation space making the data-driven DNN hard to learn and generalize. [-> solved by post-refinement procedure or discrete the rotation space and simplify it to be a classification problem then use post-refinement.]
Detect the 2D projections of 3D key points then obtain 6D pose by solving a Perspective-n-Point(PnP) problem, But Suffer from:
- Truncated objects as some of the key points may be outside the input image.
- Most of them were built on top of the 2D projection. Errors that are small in projection can be large in real 3D space.
- Different keypoints in 3D space may be overlapped after 2D projection, making them hard to be distinguished.
- Geometric constraint information of rigid objects would be partially lost due to projection.

PROBLEM

- May not be able to disambiguate the objects’ scales due to perspective projection.
- Vulnerable to heavy occlusion and poor illumination.

DeepIM adopts CNNs to learn reliable representations for template-matching.
BB8 [2]
- Applies CNNs in a multi-stage segmentation scheme to regress key-point coordinates.
- Regress projections of the object’s bounding boxes.
- Problem : The main disadvantage of this pipeline is its multi-stage nature, resulting in very slow run times.
PVNet [2]
- proposes a deep offset prediction model to alleviate negative impacts of occlusions.
- uses per-pixel voting for 2D Key-points to combine the advantages of Dense methods and keypoint-based methods.
- Designs a network which for every pixel in the image regresses an offset to some predefined keypoints, then vote for the points located on the object itself.
- Handle occlusions very well.
CDPN and Pix2pose map 3D coordinates to 2D pixels and regress pose parameters on 2D images.
LatentFusion handles unseen object poses by reconstructing a latent 3D representation.
PoseCNN [1]
- Estimates object masks, then separately estimates the translation of the object’s centroid and regresses a quaternion for rotation.

RGB-D based 6D Pose Estimation

Depth information disambiguates the object’s scale that is the most critical in RGB images due to perspective projection.

With geometry information, depth maps contribute to pose estimation for various lighting conditions and low-textured appearances, complementary to RGB images.

Current RGB-D based approaches utilize depth information mainly in three ways :

1. RGB and depth information are used at separate stages, where a coarse 6D pose is predicted from an RGB image, followed by ICP algorithm using depth information for refinement. [ICP is computationally expensive and sensitive to initialization].

2. RGB and depth modalities are fused at early stages, where the depth map is treated as another channel and concatenated with RGB channels. [fail to utilize the correlation between the two modalities and the refinement stag computationally expensive and cannot achieve real-time inference speed.

3. Fuse RGB and depth modalities at a late stage. It can achieve state-of-the-art performance while reaching almost real-time inference speed.

MCN employs two CNNs for representation learning in RGB and depth respectively and resulting features are then concatenated for pose prediction.
PoseCNN and SSD-6D [1]
- follow the coarse-to-fine scheme, where poses are initially estimated on RGB frames and subsequently refined on depth maps.
- For SSD-6D: formulate pose estimation as a discrete pose classification problem.
Morefusion builds a multi-view model to jointly reconstruct whole scenes and optimize multi-object poses.

Methods: Represent geometry clues in 3D point clouds rather than depth mas for higher efficiency
DenseFusion [3]
- designs a heterogeneous network to integrate texture and shape features and such representation proves more discriminative than single-modal ones.
- [Proposes an RGB-D based deep neural network by simultaneously considering the visual appearance and geometry structure simultaneously]
- proposed to regress rotation and translation of objects directly with DNNs. [Problem usually had poor generalization due to the non-linearity of the rotation space]
CF [~3] introduces attention modules to combine the two modalities for further improvements [exploit the consistent and complementary information between two modalities by modeling the intra- and inter-modality correlation with a self-attention mechanism.].
G2L segments point clouds of objects in scenes by frustum pointnet and regresses pose parameters via extra coor- dinate constraints.
PVN3D incorporates DenseFusion into 3D key-point detection and instance semantic segmentation, significantly boosting the performance.

Problem: the point clouds generated from the depth maps are often of a low quality, since the shape information is often incomplete and noisy.
PR_GCN develops the PRN and MMF-GCN modules to polish depth clues by generating refined point clouds and enhance integration by capturing local geometry-aware inter-modality correlations respectively, both of which are beneficial to pose estimation.

Deep Learning 6D Pose Refiners

Designed to output relative transformation between the real input image patch and the patch containing the object rendered with the predicted pose.
Some refinement algorithms rely on external object detection and pose estimation algorithms: for [DeepIM] it is PoseCNN (relies on real data), for [Deep model-based 6d pose refinement in rgb.] it is SSD6D (focuses on training on synthetic images).