TY - JOUR
T1 - 3D human pose estimation in multi-view operating room videos using differentiable camera projections
AU - Gerats, Beerend G.A.
AU - Wolterink, Jelmer M.
AU - Broeders, Ivo A.M.J.
N1 - Funding Information:
This work was sponsored by Johnson and Johnson Medical, Ltd. Jelmer M. Wolterink was supported by NWO domain Applied and Engineering Sciences VENI grant (18192)
Publisher Copyright:
© 2022 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group.
PY - 2023/7/4
Y1 - 2023/7/4
N2 - 3D human pose estimation in multi-view operating room (OR) videos is a relevant asset for person tracking and action recognition. However, the surgical environment makes it challenging to find poses due to sterile clothing, frequent occlusions and limited public data. Methods specifically designed for the OR are generally based on the fusion of detected poses in multiple camera views. Typically, a 2D pose estimator such as a convolutional neural network (CNN) detects joint locations. Then, the detected joint locations are projected to 3D and fused over all camera views. However, accurate detection in 2D does not guarantee accurate localisation in 3D space. In this work, we propose to directly optimise for localisation in 3D by training 2D CNNs end-to-end based on a 3D loss that is backpropagated through each camera’s projection parameters. Using videos from the MVOR dataset, we show that this end-to-end approach outperforms optimisation in 2D space.
AB - 3D human pose estimation in multi-view operating room (OR) videos is a relevant asset for person tracking and action recognition. However, the surgical environment makes it challenging to find poses due to sterile clothing, frequent occlusions and limited public data. Methods specifically designed for the OR are generally based on the fusion of detected poses in multiple camera views. Typically, a 2D pose estimator such as a convolutional neural network (CNN) detects joint locations. Then, the detected joint locations are projected to 3D and fused over all camera views. However, accurate detection in 2D does not guarantee accurate localisation in 3D space. In this work, we propose to directly optimise for localisation in 3D by training 2D CNNs end-to-end based on a 3D loss that is backpropagated through each camera’s projection parameters. Using videos from the MVOR dataset, we show that this end-to-end approach outperforms optimisation in 2D space.
KW - UT-Gold-D
U2 - 10.1080/21681163.2022.2155580
DO - 10.1080/21681163.2022.2155580
M3 - Article
SN - 2168-1163
VL - 11
SP - 1197
EP - 1205
JO - Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization
JF - Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization
IS - 4
ER -