In this paper, we propose a unified model for camera pose estimation and a novel strategy for pose optimization by combining points and lines in monocular visual odometry. Our proposed unified model treats point and line features equivalently, which is applicable for all the minimal cases requiring the minimum number 3 of point or/and line features and can be easily extended for various circumstances with more additional observations. The core idea is to directly retrieve all stationary points of a cost function which is minimized by the first-order optimality condition without initialization or iteration. The estimated pose is reliable due to robust geometric constraints and the reliable algebraic solver. To refine the camera pose, we propose a novel optimization strategy to minimize the unconstrained Sampson error by taking specific uncertainty for each feature into account to penalize noise more reasonably. Moreover, it is simpler than the conventional bundle adjustment by avoiding the high-dimensional parameter searching. Experimental results on simulated data and real images have sufficiently demonstrated the superiority of our proposed camera pose estimation and optimization method by comparing with state-of-the-art monocular algorithms.
We first proposed a unified model exploiting point and line features simultaneously, which has three main advantages:
(1) high universality for various environments, even the extreme scenes with scarce features;
(2) high efficiency with O(n) that even can be reduced to O(1);
(3) high accuracy without any initialization or iteration, due to robust geometric constraints and the reliable algebraic solution.
Moreover, a novel optimization strategy to minimize the
Sampson error has been presented to refine the camera
pose, which has two main strength:
(1) Each observation is allocated with a specific weight to penalize noise more reasonably;
(2) Our proposed optimization strategy has a lower complexity than the conventional BA which minimizes the re-projection error.
The basic geometric model used in this paper is shown as Fig.1. The details of algorithm are described in the paper.
To sufficiently demonstrate the performance of both the proposed unified model for camera pose estimation and the proposed optimization strategy, we utilized the simulated data and real images for comparison with state-of-the-art methods in terms of the accuracy and efficiency.
• The points based methods: EPnP, OPnP.
• The lines based methods: AlgLS, RPnL.
• The points and lines combined method: DLT.
Note that all the tested approaches were implemented in MATLAB except for SPOS in C++ on an Intel Core i7 CPU with 2.40 Ghz.,
The points based and the lines based approaches used m points and n lines, respectively, and we set m = n for the sake of fairness. It is worth emphasizing that for the joint features based DLT and PLUM, we discussed two situations with: (1) m points plus n lines (to verify the advantage of approaches that could handle two features simultaneously); (2) m/2 points plus n/2 lines (to guarantee the fairness compared with the single feature based ones on total number of features). PLUM+SPOS was initialized with the initial poses obtained from PLUM using m points plus n lines.
We did two groups of experiments to assess the accuracy of our proposed algorithms in terms of noise and the number of features, respectively. Fig. 2(a) shows the results of the first group of experiments with respective to noise. Different standard deviations of noise were added onto points or endpoints of line segments on the image. We fixed the number of features as m=6 for the points based approaches and n=6 for the lines based ones, comparing with the points and lines combined methods with m=6 plus n=6 or m/2=3 plus n/2=3.
Fig. 2(b) shows the results of the second group of experiments with respective to the number of features. We fixed stand deviations of noise as a constant of 2 pixels while increasing the numbers m and n of features.
Subsequently, we evaluated the efficiency of the joint features based algorithms PLUM and DLT. The original overall complexity of PLUM is O(n), a simple vectorization technique proposed was integrated with PLUM to further reduce the complexity to O(1).
Fig. 3 shows the average computation times after optimization, with the increasing m points and n lines. The proposed line geometric constraint is more complicated than the point one, accordingly, PLUM used lines will need more time. In contrast, the cost of DLT will increase dramatically when the number of features becomes too large.
In the end, we did a comparative experiment between SPOS and the traditional bundle adjustment (BA) solved by LM denoted as LMBA where LM was implemented depending on the Ceres Solver 1 library. We initialized both methods with the output of PLUM, and assigned them with noise of which standard deviations were fixed as 3 pixels. They were compared in terms of Erot, Etrans, the total cost time T and the number of iterations Niter until convergence. The results from different combinations of m points and n lines are shown in Table I.
For the EPFL dataset, we matched points and lines between several images as the first step, and then reconstructed 3D structures by triangulation using the poses of those images which are known in advance. To recover the unknown pose of a new image, we matched this image against the reconstructed model and got 3D-to-2D correspondences. Integrating with RANSAC to remove outliers, we estimated the camera poses by UMp, UMl, UMp,l and UMOSp,l. After that, a set of representative 3D contours and long line segments which sketch the structure of the space model were chosen manually to be back-projected on the image plane with the calculated poses, so we can evaluate the quality by visual perception. Some typical results from Fountain-P11 and Castle-P19 are shown in Fig. 4.
Then we tested our methods on the KITTI benchmark which provides an accurate ground truth (GT). We followed the standard monocular visual odometry pipeline: matching points and lines, and recovering the poses of moving car with a single camera while constructing and updating 3D structures of an environment.
Note that due to the inherent weakness of a single camera which has not an inter-camera distance to serve as an anchor like a stereo rig, the drift (especially scale) is liable to occur over time. To evaluate the raw pose estimation result, we did not adapt any optimization and the loop correction. We compared the trajectories estimated by UMp, UMl and UMp,l on several sequences. In the vast majority of cases, UMp,l are superior to UMp in terms of accuracy, while UMl often failed to track in the areas with lots of foliages where the lines are difficult to be extracted and matched. Typical experimental results on Sequence-07 containing a loop are shown in Fig. 5(a).
In a scene shown in Fig. 5(b) extracted from the sequence, though UMp can track successfully, using insufficient points will generate an unstable pose, leading to the drift. UMp,l can alleviate the drift problem to some extent due to that two feature constraints are used together, thus proving the advantage of our proposed algorithm using points and lines simultaneously.
1. Foutain P11 Castle P19 from EPFL Dataset : http://cvlabwww.epfl.ch/data/multiview/denseMVS.html
2. Sequence-07 from Kitti Benchmark: http://www.cvlibs.net/datasets/kitti/eval_odometry.php
3. Algorithm Codes of Proposed Unified Model: MATLAB Source Code
4. Algorithm Codes of Proposed Optimization Strategy: C++ Source Code
1. Haoang Li, Jian Yao*, Xiaohu Lu and Junlin Wu, Combining Points and Lines for Camera Pose Estimation and Optimization in Monocular Visual Odometry, Submitted to The 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2017) , March 2017 [PDF].