Robust Camera Pose Estimation via Consensus on Ray Bundle and Vector Field

Haoang Li, Ji Zhao, Jean-Charles Bazin, Lei Luo, and Jian Yao*

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, Hubei, P.R.China

*EMail: jian.yao@whu.edu.cn

*Web: http://www.scholat.com/jianyao http://cvrs.whu.edu.cn

1. Abstract

Estimating the camera pose requires point correspondences. However, in practice, correspondences are inevitably corrupted by outliers, which affects the pose estimation. We propose a general and accurate outlier removal strategy for robust camera pose estimation. The proposed strategy can detect outliers by leveraging the fact that only inliers comply with two effective consensuses, i.e., 3D ray bundle consensus and 2D vector field consensus. Our strategy has a nested structure. First, the outer module utilizes the 3D ray bundle consensus. We define the likelihood based on the probabilistic mixture model and maximize it by the expectation-maximization (EM) algorithm. The inlier probability of each correspondence and the camera pose are determined alternately. Second, the inner module exploits the 2D vector field consensus to refine the probabilities obtained by the outer module. The refinement based on the Bayesian rule facilitates the convergence of the outer module and improves the accuracy of the entire framework. Our strategy can be integrated into various existing camera pose estimation methods which are originally vulnerable to outliers. Experiments on both synthesized data and real images have shown that our approach outperforms state-of-the-art outlier rejection methods in terms of accuracy and robustness.

2. Approach

The proposed outlier removal strategy for camera pose estimation has a nested structure composed of the outer and inner modules.

(1) The outer module is based on the 3D ray bundle consensus (RBC) which is shown in Fig. 1(a). A bundle of rays passing through the inliers of 3D-to-2D point correspondences intersect at a common point, i.e., the optical center of the camera, while the rays formed by outliers have arbitrary directions;

(2) The inner module is based on the 2D vector field consensus (VFC). We define a virtual camera and project 3D points to its image, so that original 3D-to-2D correspondences are mapped as 2D-to-2D correspondences. Then a set of 2D vectors is formed by these 2D-to-2D correspondences, and the result is shown in Fig. 1(b). The vector inliers share a regular orientation trend, while the outliers are disordered.

Fig. 1. Our two consensus constraints for robust camera pose estimation (color blue and red represent the inliers and outliers of observations, respectively): (a) 3D ray bundle consensus; (b) 2D vector field consensus.

The details of algorithm are described in the paper.

The main contributions of the proposed strategy are summarized as follows:

(1) 3D RBC is utilized by the outer module. We define the likelihood based on the probabilistic mixture model and maximize it by the expectation-maximization (EM) algorithm. The inlier probability of each correspondence and the camera pose are determined alternately;

(2) 2D VFC is exploited by the inner module to refine the probabilities obtained by the outer module. The refinement based on the Bayesian rule facilitates the convergence of the outer module and improves the accuracy of the entire framework;

(3) The proposed outlier removal strategy is general. It can be easily integrated into various existing pose estimation methods which are originally vulnerable to outliers.

3. Experimental Results

To evaluate the proposed outlier removal strategy for camera pose estimation, we have conducted experiments on both synthesized data and real images.

3.1 Synthesized Data

We compare our method with existing state-of-the-art approaches in terms of accuracy and efficiency.
We denote our strategy based on the ray bundle consensus and the vector field consensus by RBC-VFC. Besides, the outer module of our strategy only leveraging the ray bundle consensus is denoted as RBC and tested independently. We integrate our RBC and RBC-VFC strategies into two popular pose estimation methods: classical DLT [15] and widely-used EPnP [4], respectively. The integration forms two algorithm sets as follows:

(1) DLT, EPnP with RBC: S1={DLT+; EPnP+};
(2) DLT, EPnP with RBC-VFC: S2={DLT++; EPnP++}.

We compare our methods from S1 and S2 with state-of-the art ones. We use the following three methods that are relatively efficient and do not require any pose prior:

(1) Fast outlier removal strategy for EPnP [7], which is denoted as FOR-EPnP;
(2) Two-point localization method based on the toroidal constraint [6], which is denoted as 2P-TC;
(3) Classical RANSAC [11] integrated into EPnP [4], which is denoted as RSC-EPnP. This integration can be regarded as the representative of RANSAC-alike methods.

All the above methods are implemented by MATLAB and tested on an Intel Core i7 CPU with 2.40 GHz. In the following, we present comparisons on accuracy and efficiency.

3.1.1 Evaluation on Accuracy

We design two groups of experiments with respect to the outlier ratio and the number of 3D-to-2D correspondences. Specifically, for the first group, we set the number of inliers as 50, and change the outlier ratio from 10% to 70%; for the second group, we fix the outlier ratio to 50%, and adjust the total number of correspondences from 10 to 500. We follow the criteria "rotation error" and "translation error" defined in OPnP [17] to quantitatively evaluate the accuracy of the estimated pose.

a) Test on the outlier ratio: The first row of Fig. 2 shows the accuracy for an increasing outlier ratio;

b) Test on the number of correspondences: The second row of Fig. 2 reports the accuracy for an increasing number of correspondences.

Fig. 2. Accuracy comparison with respect to the outlier ratio (first row) and the number of correspondences (second row). We compare FOR-EPnP [7], 2P-TC [6], RSC-EPnP [11] with our DLT+ and EPnP+, as well as our DLT++ and EPnP++.

3.1.2 Evaluation on Efficiency

The number of correspondences increases from 100 to 1000 with an outlier ratio of 50%. Fig. 3 presents the computational time of different approaches.

Fig 3. Computational time with an increasing number of correspondences. We compare FOR-EPnP [7], 2P-TC [6], RSC-EPnP [11] with our DLT+ and EPnP+, as well as our DLT++ and EPnP++.

3.2. Real Images

We conduct two types of experiments for different purposes: (i) the tests on the EPFL dataset [21] aim at assessing methods using the images with large angular disparities; (ii) the tests on the TUM dataset [22] focus on evaluating approaches using long sequences whose adjacent frames are similar. Specifically, we compare the performances of original EPnP [4] and its robust versions including FOR-EPnP [7], RSC-EPnP [11] and our EPnP++.

a) Tests on the EPFL dataset: We evaluate various methods on the Castle-P19 of the EPFL dataset [21]. This image set is composed of 19 images of 3072*2048 pixels, and the ground-truth poses are given. Repetitive patterns and large angular disparities of these images are prone to lead to mis-matches. We randomly select an image as the query image (shown in Fig. 4), and estimate its pose by various methods. Besides quantitative criteria "rotation error" Erot and "translation error" Etrans [17] for accuracy evaluation, we also provide an evaluation of visual alignment (readers are invited to refer to the paper for details).

Fig. 4. Experimental results on the Castle-P19 of the EPFL dataset [21]. Results of pose estimation for the query image (with embedded contour) are obtained by EPnP [4], FOR-EPnP [7], RSC-EPnP [11], and our PnP++. The pair of numbers below each image represents the rotation error Erot and translation error Etrans. Cyan contours correspond to the manually constructed 2D contours; 3D contours are projected by the estimated poses and the projections are shown in white. A better alignment between cyan and white contours means more accurate pose.

b) Tests on the TUM dataset: Fig. 5 shows the experimental results on the sequence fr3/structure texture far of the TUM dataset [22]. This sequence is captured by a hand-held camera moving along a zig-zag structure, and is composed of 938 images of 640*480 pixels. High similarity between adjacent frames contributes to correct matching. We evaluate the accuracy of the estimated trajectories by their deviations from the ground-truth trajectory (5.88 m). Note that we assess the raw results of the estimated camera positions (on purpose without bundle adjustment [15]) for a more fair, unbiased comparison of the original accuracy of the pose estimation algorithms.

Fig. 5. Test results on the sequence fr3/structure texture far of the TUM dataset [22]. (a) top view of the estimated trajectories obtained by FOR-EPnP [7], RSC-EPnP [11], and our EPnP++. The black line denotes the ground-truth trajectory; (b) 3D-to-2D point correspondences used for determining the camera pose of a randomly selected frame. The inliers (blue) and outliers (red) are identified by our EPnP++.

[1] R. Mur-Artal, J. M. M. Montiel, and J. D. Tards, “ORB-SLAM: A versatile and accurate monocular SLAM system,” IEEE Transactions on Robotics, 2015.
[4] V. Lepetit, F. Moreno-Noguer, and P. Fua, “EPnP: An accurate O(n) solution to the PnP problem,” International Journal of Computer Vision, 2008.
[6] F. Camposeco, T. Sattler, A. Cohen, A. Geiger, and M. Pollefeys, “Toroidal constraints for two point localization under high outlier ratios,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[7] L. Ferraz, X. Binefa, and F. Moreno-Noguer, “Very fast solution to the PnP problem with algebraic outlier rejection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014.
[11] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, 1981.
[15] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, second edition, 2003.
[17] Y. Zheng, Y. Kuang, S. Sugimoto, K. Astrom, and M. Okutomi, “Revisiting the PnP problem: A fast, general and optimal solution,” in IEEE International Conference on Computer Vision, 2013.
[21] C. Strecha, W. von Hansen, L. Van Gool, P. Fua, and U. Thoennessen, “On benchmarking camera calibration and multi-view stereo for high resolution imagery,” in IEEE Conference on Computer Vision and Pattern Recognition, 2008.
[22] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of RGB-D SLAM systems,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012.

Dataset and Codes

1. EPFL Dataset: http://icwww.epfl.ch/~marquez/multiview/denseMVS.html

2. TUM Dataset: http://vision.in.tum.de/data/datasets/rgbd-dataset

3. Source Codes of Proposed Methods: Source Code

Citation

1. Haoang Li, Ji Zhao, Jean-Charles Bazin, Lei Luo, and Jian Yao, “Robust Camera Pose Estimation via Consensus on Ray Bundle and Vector Field”, submitted to The 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2018) , March 2018.