Full width image

Articulate3D Challenge

Interaction Understanding

Challenge contacts:

Anna-Maria Halacheva, Yang Miao,
INSAIT, Sofia University "St. Kliment Ohridski"

This challenge is on the Articulate3D dataset which will be presented at ICCV 2025. For the challenge we provide access to the train and validation sets, and to a data loader to get you started. We will also publicly share the code behind the USDNet baseline from the Articulate3D paper, serving as a baseline for this challenge.

Participants will be able to submit their prediction on the evaluation server (eval.ai) from August, 1st to October, 15th 2025.

Task Description

Task: Given a 3D indoor scene, the objective is to identify all movable parts and predict their interaction specifications. These include the part's motion characteristics—such as axis, origin, and motion type (rotation or translation)—as well as the specific graspable region that enables interaction (e.g., a door knob or window handle).

Input: A 3D point cloud of the scene.

Output:
1. Segmentation of all movable (articulated) parts.
2. For each movable part:

  • (a) Predicted motion specification: axis, origin, and motion type.
  • (b) The mask of the associated interactable region (e.g., handle, knob, button, switch).

Challenge Phases

The challenge is divided into two main phases:

Development Phase: Participants are encouraged to use the training and validation splits of the Articulate3D dataset for experimentation and method development. All annotations for these splits are publicly available.

Test Phase: Participants will submit their predictions on the test set to the evaluation server. Ground truth annotations for the test split will remain private. (Server link coming soon.)

Participants will be able to submit their prediction on the evaluation server (eval.ai) from August, 1st to October, 15th 2025.

Working with Articulate3D

⚠️ Note: Articulate3D annotations are based on ScanNet++ scenes. You must obtain the ScanNet++ scans separately. Articulate3D only provides the annotations from the Articulate3D paper.

Articulate3D Annotations: Articulate3D offers diverse per-scene annotations, with the relevant annotations for this challenge:

  • (a) Segmentation masks for the parts of all articulated objects. The parts include fixed, movable and interactable (graspable) parts.
  • (b) Connectivity graphs of the parts (e.g., to which door a certain knob belongs, and to which cabinet the door belongs).
  • (c) Motion specifications for each movable part (origin, axis, motion range, and type).

Click for format details

Dataset Structure

Each scene contains two JSON files named using the following convention:

{scannetpp_scan_id}_parts.json  
{scannetpp_scan_id}_artic.json
  • {scannetpp_scan_id}: The ID of the ScanNet++ scene.
  • parts.json: Contains part segmentation annotations.
  • artic.json: Contains articulation (motion) annotations.

Part Segmentation Annotations

  • Face-based segmentation is provided via the triIndices field.
  • Vertex-based segmentation can be derived by voting over face labels.
  • Hierarchy Representation: Encoded in the part label using a dot-separated string:
    {obj_id}.{parent_hierarchy}.{own_hierarchy}.{label}
    Example:
    3.1.cabinet  
    3.1.2_1.door  
    3.1.2_2.door  
    3.1.2_1.3_1.handle
    Explanation:
    • 3.1.cabinet: A cabinet object.
    • 3.1.2_1.door, 3.1.2_2.door: Two doors of the cabinet.
    • 3.1.2_1.3_1.handle: Handle on the first door.

Articulation Annotations

  • Each movable part is indexed by its pid, which corresponds to a partId in parts.json.
  • The base field denotes the static reference part (e.g., door frame).
  • The base can also be inferred from the label hierarchy as the direct parent part.


Data Loader: A Python-based scene iterator that returns:

  • (1) A dictionary of movable parts, each with predicted motion and its list of interactable parts.
  • (2) A face-level scene mask marking all movable and interactable segments.

📤 Submission Instructions

1. What to Submit

Participants must submit a Pickle file (.pkl) containing predictions for each scan. Each prediction must include all detected movable and interactable instances in that scan.

An example file will provided for download here by July 15th.

The file should be structured as a dictionary where:

{
  "scene_id_1": {
    "pred_masks": numpy.Array, // binary mask over mesh vertices (Num_vertices, num_pred_parts)
    "pred_scores": numpy.Array, // confidence scores for each predicted part (num_pred_parts)
    "pred_classes": numpy.Array, // class labels for each predicted part (num_pred_parts), 1: rotation, 2: translation
    "pred_origins": numpy.Array, // axis origin for each predicted part (num_pred_parts, 3)
    "pred_axes": numpy.Array, // axis direction for each predicted part (num_pred_parts, 3)
  },
  ...
  "scene_id_2": {
    ...
  }
}

2. Metrics Computed

The following evaluation metrics will be computed on your submission:

  • AP@50%: Average Precision at 50% IoU threshold (standard semantic instance segmentation)
  • Articulation-specific metrics:
    • MA: Match with correct Axis
    • MO: Match with correct Origin
    • MAO-ST: Match with both Axis and Origin

BibTeX


    @article{halacheva2024articulate3d,
      title={Holistic Understanding of 3D Scenes as Universal Scene Description},
      author={Anna-Maria Halacheva and Yang Miao and Jan-Nico Zaech and Xi Wang and Luc Van Gool and Danda Pani Paudel},
      year={2024},
      journal={arXiv preprint arXiv:2412.01398},
    }