Combining Points and Lines for Camera Pose Estimation and Optimization in Monocular Visual Odometry

Haoang Li, Jian Yao*, Xiaohu Lu and Junlin Wu

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, Hubei, P.R.China

*EMail: jian.yao@whu.edu.cn

*Web: http://www.scholat.com/jianyao http://cvrs.whu.edu.cn

1. Abstract

In this paper, we propose a unified model for camera pose estimation and a novel strategy for pose optimization by combining points and lines in monocular visual odometry. Our proposed unified model treats point and line features equivalently, which is applicable for all the minimal cases requiring the minimum number 3 of point or/and line features and can be easily extended for various circumstances with more additional observations. The core idea is to directly retrieve all stationary points of a cost function which is minimized by the first-order optimality condition without initialization or iteration. The estimated pose is reliable due to robust geometric constraints and the reliable algebraic solver. To refine the camera pose, we propose a novel optimization strategy to minimize the unconstrained Sampson error by taking specific uncertainty for each feature into account to penalize noise more reasonably. Moreover, it is simpler than the conventional bundle adjustment by avoiding the high-dimensional parameter searching. Experimental results on simulated data and real images have sufficiently demonstrated the superiority of our proposed camera pose estimation and optimization method by comparing with state-of-the-art monocular algorithms.

2. Approach

We first proposed a unified model exploiting point and line features simultaneously, which has three main advantages:

(1) high universality for various environments, even the extreme scenes with scarce features;

(2) high efficiency with O(n) that even can be reduced to O(1);

(3) high accuracy without any initialization or iteration, due to robust geometric constraints and the reliable algebraic solution.

Moreover, a novel optimization strategy to minimize the Sampson error has been presented to refine the camera pose, which has two main strength:

(1) Each observation is allocated with a specific weight to penalize noise more reasonably;

(2) Our proposed optimization strategy has a lower complexity than the conventional BA which minimizes the re-projection error.

Fig. 1.An illustration of the geometric model for our proposed unified
camera pose estimation equivalently using m points and n lines.

The basic geometric model used in this paper is shown as Fig.1. The details of algorithm are described in the paper.

3. Experimental Results

To sufficiently demonstrate the performance of both the proposed unified model for camera pose estimation and the proposed optimization strategy, we utilized the simulated data and real images for comparison with state-of-the-art methods in terms of the accuracy and efficiency.

3.1. Simulated Data

To evaluate the accuracy of the calculated pose R and t, we calculated the absolute error for the rotation Erot and relative error Etrans for the translation. All experimental results are the averages of 1000 independent trials.

In the following, the proposed unified model for camera pose estimation is denoted as PLUM, and the Sampson error minimization based optimization strategy is denoted as SPOS. We refer to the case that PLUM is optimized by SPOS as PLUM+SPOS. We compared PLUM and PLUM+SPOS with the following state-of-the-art camera pose estimation algorithms in MVO:
• The points based methods: EPnP, OPnP.
• The lines based methods: AlgLS, RPnL.
• The points and lines combined method: DLT.

Note that all the tested approaches were implemented in MATLAB except for SPOS in C++ on an Intel Core i7 CPU with 2.40 Ghz.,

The points based and the lines based approaches used m points and n lines, respectively, and we set m = n for the sake of fairness. It is worth emphasizing that for the joint features based DLT and PLUM, we discussed two situations with: (1) m points plus n lines (to verify the advantage of approaches that could handle two features simultaneously); (2) m/2 points plus n/2 lines (to guarantee the fairness compared with the single feature based ones on total number of features). PLUM+SPOS was initialized with the initial poses obtained from PLUM using m points plus n lines.

Fig. 2. Comparative results on simulated data in term of accuracy: (a) with respect to noise; (b) with respect to number of features..

We did two groups of experiments to assess the accuracy of our proposed algorithms in terms of noise and the number of features, respectively. Fig. 2(a) shows the results of the first group of experiments with respective to noise. Different standard deviations of noise were added onto points or endpoints of line segments on the image. We fixed the number of features as m=6 for the points based approaches and n=6 for the lines based ones, comparing with the points and lines combined methods with m=6 plus n=6 or m/2=3 plus n/2=3.

Fig. 2(b) shows the results of the second group of experiments with respective to the number of features. We fixed stand deviations of noise as a constant of 2 pixels while increasing the numbers m and n of features.

Fig. 3.Running times for different numbers of correspondences. The blue, red and black lines represent the cost times of PLUMusing points, lines, and combination of two features, respectively, after applying the vectorization technique. The performance of DLT is drawn with cyan.

Subsequently, we evaluated the efficiency of the joint features based algorithms PLUM and DLT. The original overall complexity of PLUM is O(n), a simple vectorization technique proposed was integrated with PLUM to further reduce the complexity to O(1).

Fig. 3 shows the average computation times after optimization, with the increasing m points and n lines. The proposed line geometric constraint is more complicated than the point one, accordingly, PLUM used lines will need more time. In contrast, the cost of DLT will increase dramatically when the number of features becomes too large.

In the end, we did a comparative experiment between SPOS and the traditional bundle adjustment (BA) solved by LM denoted as LMBA where LM was implemented depending on the Ceres Solver 1 library. We initialized both methods with the output of PLUM, and assigned them with noise of which standard deviations were fixed as 3 pixels. They were compared in terms of Erot, Etrans, the total cost time T and the number of iterations Niter until convergence. The results from different combinations of m points and n lines are shown in Table I.

3.2. Real Images

We also tested our algorithms on real images from the EPFL dataset and the KITTI benchmark to validate the proposed methods. These images have sufficient structured objects with a large amount of points and lines which can be utilized by the algorithms based on joint features. We conducted two sets of experiments for distinct purposes: the former used the EPFL dataset, focusing on the performances for images with the big angular disparity; the latter used consecutive image sequences from the KITTI benchmark in which adjacent two frames have a high similarity, to test our proposed algorithms in the long distance and the large scale. There are total four cases to estimate the camera poses via the proposed unified model: (1) using only points (UMp); (2) using only lines (UMl); (3) using points plus lines (UMp,l); (4) using points plus lines followed by our optimization strategy (UMOSp,l).

Fig. 4. Illustrations on the EPFL Fountain-P11 dataset in the first row and the Castle-P19 dataset in the second row. The first three columns represent the outcomes using only points, only lines and joint features, respectively. The last column reports the optimized results initialized with the result of the third column. Numbers at the bottom of each image represent the rotation and translation errors in a pair ((deg), (%)). Cyan patterns on images are the manually selected 3D line segments and ontours for building correspondences. Using the estimated poses, those 3D elements were projected onto the image planes marked in the white color.

For the EPFL dataset, we matched points and lines between several images as the first step, and then reconstructed 3D structures by triangulation using the poses of those images which are known in advance. To recover the unknown pose of a new image, we matched this image against the reconstructed model and got 3D-to-2D correspondences. Integrating with RANSAC to remove outliers, we estimated the camera poses by UMp, UMl, UMp,l and UMOSp,l. After that, a set of representative 3D contours and long line segments which sketch the structure of the space model were chosen manually to be back-projected on the image plane with the calculated poses, so we can evaluate the quality by visual perception. Some typical results from Fountain-P11 and Castle-P19 are shown in Fig. 4.

Fig. 5.Results on Sequence-07 from the KITTI benchmark: (a) the top view of trajectories where the black line represents the ground truth, the blue line stands for the trajectory recovered with only points, and the red line denotes the trajectory recovered with the combination of points and lines.; (b) a selected image of a bend with less point features (line inliers marked in green and outliers marked in red).

Then we tested our methods on the KITTI benchmark which provides an accurate ground truth (GT). We followed the standard monocular visual odometry pipeline: matching points and lines, and recovering the poses of moving car with a single camera while constructing and updating 3D structures of an environment.

Note that due to the inherent weakness of a single camera which has not an inter-camera distance to serve as an anchor like a stereo rig, the drift (especially scale) is liable to occur over time. To evaluate the raw pose estimation result, we did not adapt any optimization and the loop correction. We compared the trajectories estimated by UMp, UMl and UMp,l on several sequences. In the vast majority of cases, UMp,l are superior to UMp in terms of accuracy, while UMl often failed to track in the areas with lots of foliages where the lines are difficult to be extracted and matched. Typical experimental results on Sequence-07 containing a loop are shown in Fig. 5(a).

In a scene shown in Fig. 5(b) extracted from the sequence, though UMp can track successfully, using insufficient points will generate an unstable pose, leading to the drift. UMp,l can alleviate the drift problem to some extent due to that two feature constraints are used together, thus proving the advantage of our proposed algorithm using points and lines simultaneously.

Dataset and Codes

1. Foutain P11 Castle P19 from EPFL Dataset : http://cvlabwww.epfl.ch/data/multiview/denseMVS.html

2. Sequence-07 from Kitti Benchmark: http://www.cvlibs.net/datasets/kitti/eval_odometry.php

3. Algorithm Codes of Proposed Unified Model: MATLAB Source Code

4. Algorithm Codes of Proposed Optimization Strategy: C++ Source Code

Citation

1. Haoang Li, Jian Yao*, Xiaohu Lu and Junlin Wu, Combining Points and Lines for Camera Pose Estimation and Optimization in Monocular Visual Odometry, Submitted to The 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2017) , March 2017 [PDF].