LineMOD:

  • 15 objects, 8 objects per image
  • There are ~1200 frames for each object,
  • This split yields around 200 training frames and 1000 testing frames.
  • 15% of the frames are training data and 85% of the frames are testing data.
  • The train and test data are selected from the same video sequences in LM.

(This means that the illumination conditions and object appearances are similar in the train and test data.)


YCB-V

  • It contains 92 video sequences with a total of 133,827 frames,
  • An average of 5 objects visible per image.
  • The official train/test split uses
  • 80 video sequences for training.
    • 80,000 frames of synthetic data are also provided by the YCBV dataset as an extension to the training set.
  • Testing is performed on the 2,949 keyframes chosen from the remaining 12 sequences.
  • The train and test data of YCB-V are selected from different video sequences.

(the illumination condition and object appearance are very different. So this dataset is potentially more challenging, especially for color based methods.)


T-LESS

  • 50K images of 30 different consumer electrical components.
  • (~39K) The training set consists of individuals objects on a black background.
  • (~10K) The test set consists of 20 different static scenes with different levels of clutter, backgrounds and numbers of objects.
  • 1,296 training images per object from each sensor.
  • 504 test images per scene by each sensor (3).
  • T-LESS includes training images and 3D models of 30 objects (top) and test images of 20 scenes (bottom) – shown overlaid with colored 3D object models at the ground truth poses.
  • The images were captured from a systematically sampled view sphere around an object/scene.

APC

  • It contains a set of 24 different objects, some non-rigid, some transparent, arranged on a set of shelves/bins.
  • 3 different views of each scene are provided:
  • one directly front-on to the shelf.
  • one slightly to the left.
  • one the same distance to the right.
  • 10K test RGB-D with various amounts of occlusion.

NOCS

  • Context-Aware MixEd ReAlity (CAMERA):In total, 300K composited images.
  • 275K for training.
  • 25K for validation.

  • REAL: 8K RGB-D frames

  • 18 different real scenes (7 for training, 5 for validation, and 6 for testing) using a Structure Sensor.
  • 4300 for training.
  • 950 for validation.
  • 2750 for testing.

HOPE

  • Dataset contains 28 objects in 50 cluttered scenes, nearly 5 lighting variations per scene (for a total of 2038 images)
  • An average of more than 5-20 objects per scene.
  • Images were captured from distances of 0.5 to 1.0 m, which are typical of robotic grasping.

  • The dataset features:

  • Train: 10 scenes (2038 RGBD images), POINT OF VIEW/ SCENE (365, 331, 273, 145, 151, 169, 171, 181, 131, 121)
  • Val: 10 scenes (50), 5 POINT OF VIEW/ SCENE
  • Test: 40 scenes (188) Different numbers of VIEW POINT/ SCENE

DoPose

  • The data contains 301 scenes and 3325 view images.
  • Most of the scenes contain mixed objects.
  • The dataset contains 19 objects in total.
  • The dataset contains 2 different types of scenes (table and bin). Each scene contains different view angles.
  • For bin scenes:
    • The data contains 183 scenes with 2150 image views.
    • 35 scenes contain 2 views,
    • 20 contains 3 views.
    • 128 contains 16 views.
  • For table scenes:
    • The data contains 118 scenes with 1175 image views.
    • 20 scenes contain 3 views
    • 50 scenes with 6 images.
    • 48 scenes with 17 images.

HomebrewedDB

  • 33 objects (17 toy, 8 household and 8 industry-relevant objects)

  • VALIDATION AND TEST SEQUENCES:

  • The dataset features 13 scenes of varying complexity.
  • Each scene was captured with two sensors: PrimeSense Carmine and Kinect 2.
  • For each scene there are:
    • 340 validation RGBD frames captured on a rotating turntable
    • 1000 test RGBD frames captured in a handheld mode.
  • The pose labels are provided for each of the objects in each frame.

FAT

  • The two types of scenes:
  • single (single falling object).
  • mixed (2 to 10 falling objects).

  • Single

    • For single each of the 21 object types has its own folder:
    • Within each folder there are 3 different scenes (kitchen, kitedemo, and temple) and 5 independent locations (0 through 4) within each scene:
    • Each of these subfolders contains a dataset of 100 images of a particular object within a particular scene location, thus leading to 21 x 3 x 5 x 100 = 31500 image frames for single.
  • Mixed

    • For mixed the images are organized by the 3 scenes and 5 locations within each scene, similar to above.
    • Each of these subfolders contains a dataset of 2000 images of objects within a particular scene location, thus leading to 3 x 5 x 2000 = 30000 image frames for mixed.
kitchen_0  kitchen_2  kitchen_4   kitedemo_1  kitedemo_3  temple_0  temple_2  temple_4
kitchen_1  kitchen_3  kitedemo_0  kitedemo_2  kitedemo_4  temple_1  temple_3

MetaGraspNet

  • The dataset contains 100,000 RGBD images, 11,000 scenes, and 25 classes of objects.
  • It split into 5 difficulties to evaluate object detection and segmentation model performance in different grasping scenarios.
  • 5 Scenarios with 11k scenes:

    • 841 Sc , 1-9 VP (max) , 5236 images
    • 166 Sc , 1-9 VP (max) , 403 images
    • 409 Sc , 1-9 VP (max) , 2070 images
    • 6355 SC, 1-9 VP (max) , 33698 images
    • 8499 SC, 1-9 VP (max) , 57710 images
  • NOTE: some scenes are common between different scenarios.


StereOBJ-1M

  • Dataset contains over 396K frames.
  • Over 1.5M annotations of 18 objects recorded in 183 scenes constructed in 11 different environments.
  • The 18 objects include 8 symmetric objects, 7 transparent objects, and 8 reflective objects.
  • The test set contains 32 image sequences that are selected to cover most environments and ensure every object is tested in at least 4,000 images across at least 3 different scenes.
  • A stereo video was recorded in every constructed scene.
  • The lengths of the video range from 2 to 7 minutes. When sampled at 15 frames/sec.
  • The recorded videos yield 396,509 stereo frames in total.
  • On average, there are more than 2,100 stereo frames in every scene.