Está en la página 1de 7

Hand Detection and Tracking Using Depth and Color

Information
Minsun Park, Md. Mehedi Hasan, Jaemyun Kim and Oksam Chae
Department of Computer Engineering, Kyung Hee University,
446-701, Seocheon-dong, Yongin-si, Gyeonggi-do, Republic of Korea
{ romana2ms, mehedi, sense21c, oschae } @ khu.ac.kr

Abstract - The detection and tracking of a hand is an


emerging research issue now-a-days to control the devices by
hand motion. Conventional hand detection methods use color
and shape information from a RGB camera. With the recent
advent of the depth camera, some researchers show that they
can improve the performance of hand detection by combining
the color (or intensity) information with the information from
the depth camera. In this paper, we propose a novel method
for hand detection using both color and depth information
from Microsofts Kinect device. The proposed method extract
the candidate hand regions from the depth image and select
the best candidate based on the color and shape feature of
each candidate regions. Then the contour of the selected
candidate is determined in the higher resolution RGB image
to improve the positional accuracy. For the tracking of the
detected hand, we propose the boundary tracking method
based on Generalized Hough Transform (GHT). The
experimental results show that proposed method can improve
the accuracy of hand motion detection over conventional
methods.
Keywords: Hand detection, Depth and color, Kinect,
Histogram, Tracking.

Introduction

The advent of relatively low resolution image and depth


sensors has spurred research in the field of object tracking and
gesture recognition. Making an interface and controlling a
device by detecting & tracking different gestures of an object
is an emerging research issue now-a-days. Among a variety of
motions to interface with devices, the hand is the most
convenient body part and has been widely utilized. To do this
type of research, Microsoft Kinect is one of the most popular
devices, which has sensors that capture both rgb and depth
information. Many experiments have done to detect hand
based on skin color and do hand detection based on depth
information from Kinect. Detecting the location of the hand is
the initial step to detect and track hand gestures. It is more
challenging than face or body recognition. Because the depth
images of Kinect has very low resolution and it is hard to
detect and track small objects, which cover small area than
background. The most common way to detect hand is to

threshold based on the depth information which is also a


challenge to choose adaptive threshold to select hand regions.
This involves in cropping out those pixels whose z-value
(depth) deviates too far from this estimated depth. While this
works well for more expensive cameras with high spatial and
depth resolution (images with dimensions on the order of 100s
to 1000s of pixels, resolution on the order of a few
millimeters [1]). But in Kinect images have dimensions on the
order of 10s of pixels and depth images provided by the
sensors has a nominal accuracy of 3mm. The depth
information provided by the sensors reluctantly consents to
infrared occlusion and other noisy effects.
Skin color-based [2] hand detection method has the
advantage of making hand region detection relatively easier
because of using color information. For the same reason, to
differentiate hands from overlapping hands or objects with
similar skin color is too difficult. Also, color is sensitive to
illumination variations and noise, which is another drawback.
The depth image can overcome the above drawback. We can
distinguish object from the background using distance
information much easier and is less sensitive to lighting
changes and complicated background. However, it is still hard
to distinguish different objects at the same distance and
extract shape of an object in detail because of low resolution
of depth image. Moreover, a lot of noises are added due to the
use of infrared camera. It also generates occlusion, and then
we need additional work for the compensation. To overcome
all the obstacles, we incorporate rgb information in addition to
depth information to enhance our estimation of hand locations.
In this paper, a novel method of hand detection and
tracking is proposed that can detect hand faster and more
accurate than conventional methods. Here, we use Kinect to
obtain color and depth images at the same time. Our method
first calculates a histogram of the depth image according to
distance information and analyzes the histogram to find the
appropriate candidates values and generates a criteria function
to extract hand regions from the image. This proposed method
can extract hand regions adaptively without preprocessing
used by the conventional methods.

Related Works

In recent years, research on exactly recognizing hands


using color and depth information and compensating the

shortcomings of both color and depth information has been


studied simultaneously. In [3], the method of hand detection
using combination of stereo and rgb camera is introduced and
method based on combination of ToF and rgb camera is
illustrated in [4]. In general, the conventional method finds
face or body at first, which is easier than detecting hands.
Then based on the distance data from camera, we define
threshold to detect hands from body part. In this method, the
role of color information obtained from a general camera is
selecting candidates by using skin color and also
compensating the shortcomings of distance information. In [5],
the method of hand detection calculating the gray scale
histogram from a range image is introduced. They define the
noise as threshold in the image to locate humans. For hand
detection based on users face color that is flexible to adjust
the range of skin colors is introduced in method [4]. As
mentioned earlier, the existing type of camera is stereo and
ToF (Time of Flight) camera. In the case of stereo camera, it
consists of two general cameras and obtains depth information
from this structure, but this camera is not suitable for tracking
in real-time system because of installation and calibration
problem. In the case of ToF camera, it can obtain distance
information directly because of using infrared, but the
development has been limited due to the higher price of the
camera. In this paper, Microsofts Kinect which has pioneered
the issue as soon as its released is used, Kinect can obtain
color and depth information at the same time, and also
provides basic library for solving the problem easily; such as
calibration. Thus, the device is being welcomed by many
developers and researchers who have been interested in this
area especially because of low cost and easy installation. In
[6], the hand detection is processed using functions provided
by OpenNI and in [7], it introduces the method of hand
detection based on skin detection, followed by estimating
hand position dependent on humans body. In Figure 1
conventional approach for detecting hand is shown. Here
finding face or body in the initial approach, which is easier
than detecting hands at first because of human face or body
has more distinguishable features, but it needs actually much
more time and effort for computation on preprocessing,
detecting face or body.
RGB
Information

Human Body/
Face Detection

Skin Detection

Hand Detection

TOF
(Distance
Information)

Distance
Calculation

Tracking/ Gesture
Recognition

Threshold

Figure 1. Conventional hand detection approach

Proposed Method

In this paper we propose an adaptive hand detection


approach by using 3-dimensional information from Kinect and
track the hand using GHT-based method. Figure 2 illustrates
the system overview of the proposed method. When obtaining

both color and depth image from the Kinect at the same time,
synchronization and registration between images should be
considered because of using color and depth information
simultaneously. To summarize proposed algorithm briefly,
first we detect candidate hand regions from the histogram of
the depth image, and rank each candidate region by using
color information to reduce candidate regions. Then obtain
the boundary of the hand to get the exact positional accuracy.
Actually the depth image includes many unwanted portion of
the hand regions because of noise and low resolution troubles,
so we use color information to compensate the disadvantage
of the characteristics in depth image and improve the rate of
accuracy for extracting the contour of hands. Finally, we
perform the trace using edge segment based tracking
algorithm.
High Resolution RGB Image

Hand Region
Selection Using
Color and Shape
Accurate
Boundary
Extraction

Microsoft
Kinect

Hand Contour
Search Region
Tracing

Track Hand
Using GHT

Candidate
Region Selection
Using Depth
Thresholding

Low Resolution Depth Image

Figure 2. Overview of the proposed system

3.1

Candidate Hand Region Selection from


Depth image

To use color and depth information at the same time for


detecting and tracking moving hands, initialization is needed
to be considered. The first is the synchronization between
color and depth image. As a result of measuring the number of
frames per second respectively, synchronization is processed
automatically by Kinect. Another consideration is the
registration between color and depth image. In Kinect the
color image consists of 640480 resolution and depth image
resolution is 320 240. So, resolution synchronization is
needed to be performed for the two different resolution
images. Also the position of rgb camera lens and the position
of depth camera lens are not exactly the same. To solve this
problem, we use functions which are supported by OpenNI.
After the initialization process, hand detection and tracking
algorithm begins to perform. A depth image is composed of
eight bit gray value. An object which is near to camera is
closer to zero and farther away from the camera tends to 255
and the range which cannot be measured is treated as zero. So,
the pixel value of depth image increases from 0 to 255
depending on the increasing distance of object from the sensor.
In Figure 3 the color information and depth information of the
camera is given respectively.

detects hand regions. If x, y and z are the three consecutive


candidate points, x y p and y z q and if p q
then,

m
Figure 3. The high resolution color image and low resolution depth
image generated by Kinect

From the depth image, we calculate the histogram


according to distance and then eliminate unnecessary noise
region in histogram before analyzing it. Unnecessary regions
which are less than ten are eliminated and histogram
smoothing is performed to make histogram analysis more
convenient. Before analyzing the histogram and generating
the threshold to separate hand regions from background, we
assume that the hand is in front of the body and the user is
extending his/her hand standing at the fixed position. From
the preprocessed histogram, we begin to analyze the
histogram in order to find the area which corresponds to hand
regions. To do this, we find an appropriate threshold value to
separate the hand region from the background. From an
accumulated histogram, we figure out the distance from the
camera to the hand, body and background, and also size of
each object. In general, we realize rapid increase at those
regions (hand, body and background). Figure 4 illustrates the
accumulated histogram calculated from the depth image and
we can recognize the three slopes in the histogram. As we
know the area of the hand is very small and it is in front of the
body, it will cover less area than body. When we go deeper
the regions are growing higher because of the more regions
covered by arm and then body part. Finally, the background
parts construct a high slope and very high accumulation in the
histogram, shown with the third arrow in the following figure.

(z2 x2)
(z1 x2)

(1)

Where z1 x1 and (x1, x2) and (z1, z2) are the coordinate
points of x and z . When, m Thxz the region is selected as
a candidate region for threshold. For our experiment we
choose y as deciding threshold point. Then unwanted
selected noise portions are filtered out by considering the
shape or area information. Figure 5 shows the extracted
regions (yellow color) selected after computing equation (1)
and noise filtering. In the figure we also have seen that
unwanted regions are also selected after threshold in the depth
image.

Figure 5. A depth image where candidate regions are selected in yellow


color

3.2

Measure Skin Color from Candidate


Regions

Since there would be objects surrounding the user with


similar distance as a users hand, unwanted portions are also
selected as a hand region that are all called candidate hand
regions. We select hand regions from the candidates as
following the steps below:
I.

Figure 4. An accumulated and modified histogram from the depth image

To generate a criterion function, we choose some


candidate points. The points are selected after a certain
interval from the histogram. Then calculate the difference of
one point with the following candidate point. If we get two
increasing differences one after another, we calculate the first
order derivative. If the slope is greater than a certain value,
then this region is selected as a candidate region to
differentiate a hand part from body and background. The
following equation is the criteria function or threshold that

Accumulate shape information of the regions


extracted from depth image.
II. Measure the color similarity of the candidate
region with the skin from the color image.
III. Rank the candidates based on the color and shape
similarity.
IV. Select the best candidate region as a hand region.
To get the shape information, we map the detected
regions into a color image and get the extracted shape
information from the depth image. Then we measure the color
similarity. Since humans hand is skin color we use skin
detection algorithm and there are already many famous skin
detection algorithms. In this paper, we use Bayesian-based
skin color detection algorithm [8]. The algorithm is the one of

the most popular skin detection algorithms that is accurate and


fast. It uses skin and non-skin color models to design a skin
pixel classifier with an equal error rate of 88%. This is
surprisingly good performance given the unconstrained nature
of Web images. Visualization studies demonstrate the
separation between skin and non-skin color distributions that
makes this performance possible. Using this skin classifier,
which operates on the color of a single pixel, it constructs a
system for detecting images containing naked people. This
second classifier is based on simple aggregate properties of
the skin pixel classifier output. The naked people detector
compares favorably to recent systems by Forsyth et al. [9] and
Wang et al. [10], which are based on complex image features.
Because it is based on pixel-wise classification, detector is
extremely fast. These experiments suggest that skin color can
be a more powerful cue for detecting people in unconstrained
imagery than was previously suspected. Figure 6 describes a
standard likelihood ratio approach to classify skin and
equation (2) is the classifier for skin in our method.

PGP(skin | rgb)

Phist(rgb | skin)
Phist(rgb | skin)

(2)

3.3

Tracing of Hand Contour

We know that a function that can combine a number of


entities to form a closed polyline consisting of individual
segments is called Contour tracing. Since the depth image has
low resolution it is difficult to extract the boundary of a hand
distinctly. More accurate hand contour can be extracted from
the hand contour of the color image using the low resolution
contour determined from the depth image. We overlay the low
resolution candidate contour on the color image and define
the search area for the contour tracing. Then we trace the hand
contour using the skin color information in the search area.
Contour tracing [11] incorporates extraction of edge lines
from image. In our algorithm, we dont need to search region
of interest in the whole image. From the previous steps we
define our search area from the depth image. And from that
search regions we search the defined region by using the skin
color classification criteria. As, we define search regions of
the high resolution image from the candidate region defined
by the low resolution depth image. It makes our algorithm
faster to track hands. Not only that, it also incorporates more
confidence to the detection process and gain higher accuracy.
Figure 8 shows the hand contour selection process.

P(rgb)

skin

non-skin

P(rgb | skin)

P(rgb | skin)
rgb

(a) Search area defined by depth image

Figure 6. A skin classifier is derived from standard likelihood ratio


approach

Where rgb is the color, PGP is the Gaussian probability that


rgb is skin color and Phist(rgb | skin) and Phist(rgb | skin)
are the histogram-based probabilities that rgb belongs to the
skin and non-skin classes respectively. After measuring color
similarity with skin detection, we rank each candidate region
with color and shape similarity and choose the highest
possibility regions. The result of that process is shown in
Figure 7. As it has been seen, hand regions are becoming more
specific and accurate by using color information.

Figure 7. (a) Mapping depth image to color image and (b) Best candidate
region is selected after skin detection and ranking

(b) Contour selection from color image

Figure 8. Tracing of Hand Contour

3.4

Tracking Based on Generalized Hough


Transform

In the step of tracking, we use GHT based moving object


tracking algorithm which is suggested in [12]. In order to
describe briefly, it tracks the moving objects more robustly by
generating a reference pattern and updating the pattern during
matching step that minimizes the effect of background pixels.
The target matching scheme is based on a Generalized
Hough Transform (GHT) [13]. It overcomes edge distortion
and can find match from partial information with relatively
less amount of time. It uses edge information among the
various feature-based methods which has relatively low
amount of computation than another features. And it
calculates the weight of each edge that is indicating the
persistence of the existence on the time axis of edge pixels to
overcome the missing edge pixels, which is used for
generating a reference pattern and matching target in order to
complement obscured. Based on GHT, it can use efficiently
missing partial edge information caused by noise. The
algorithm is generally used in devices with limited computing
power such as mobile devices, digital cameras and smart

phones for tracking continuous subjects in real time. Figure 9


illustrates the overview of GHT based tracking algorithm.

Detected Image

Input Image

Reference
Table
Initialization

Reference
Table
Update

GHT Based
Candidate
Selection

Confidence
Computation

hand is closest to camera, proposed method can detect the


hand at any situation with the assumption that hand position is
in front of the body part.

Search Region
Estimation

Figure 9. Overview of GHT based tracking algorithm

In this paper, edge segment based tracking is used to


track the hand in real time. Figure 10 shows the result of
tracking the hand after using Generalized Hough Transform.
(a) Hand detection according to different distance

Figure 10. Tracking the hand using GHT

Experimental Results

To setup the experimenting environment we use


Microsoft Kinect which is capable of 3D modeling and can
generate depth image and color image consequently. OpenNI
function is used to synchronize the depth image coordinates
with color image coordinates for mapping. We setup the
working principle in two ways for experiments. First we verify
our hand detection approach for different hand directional
approaches and then compute its accuracy. We also compare
our result with other popular hand detection algorithms. In the
second part we compute hand tracking approach based on
segmenting the edges.

4.1

(b) Hand detection according to different movement

Figure 11. Results of hand detection in different situation

To compute the accuracy of our experiment we measure


the number of pixels detected in ration with the number of
original hand pixels. In our experiment we compute the
accuracy for different steps of our algorithm which is shown
in Figure 12.

Experiment on Hand Detection

Figure 11 shows the result of hand detection. Firstly we


stand in front of camera and move back and forth and also
rotate our hand at any directions for testing the proposed
algorithm. Figure 11 (a) verifies that hand detection process
has been done well even though distance is changed by
obtaining flexible threshold value. In addition, Figure 11 (b)
shows that out method can detect the hand even if human
rotate his/her hand in any direction. From the experiments, if

Figure 12. Detection rate on normal sequence

To compare our algorithm with different approaches the


basic measures in accurate detection in general are recall,

precision, and accuracy showed in equation (3) and (4).


Recall quantifies what proportion of the correct entities
(number of pixels) is detected, while precision quantifies what
proportion of the detected entities are correct. Accuracy
reflects the temporal correctness of the detected results.
Therefore, if we denote by P the pixel positions correctly
detected by the algorithm, by PM the number of missed
detections (the pixels that should have been detected but were
not) and by PF the number of false detections (the positions
that should not have been detected but were). Table 1 shows
the comparison of our method with Bergh et.al., based on
precision and recall. The result shows that our method can
accurately determine the hand region than other method.

Recall

Experiment on Hand Tracking

Figure 13 shows the result of hand tracking. Based on


GHT tracking edge segments are used. Tracing of hands has
been done well, which is fast and accurate in general. Since
the features of hand are very small, sometimes it is difficult to
track the hand, but our tracking process has a good
performance in accuracy and speed.

(3)

P PM

Precision

4.2

Frame No. 132

Frame No: 136

Frame No: 148

Frame No: 152

Frame No: 166

Frame No: 178

(4)

P PF

Table 1. Comparison with different approaches

Approaches

Recall

Precision

RGB + ToF (ts=20) [4]

80.05%

82.36%

RGB + ToF (ts=15) [4]

74.32%

78.76%

Figure 13. Hand Tracking result for different video frames

Proposed Method

81.06%

86.42%

In the table we have compared our result with another


method for different ts values and it is matter of fact that for
different values of the static threshold we will get different
accuracy. But our algorithm is adaptive, no need for manually
fixing the parameters and also gain higher accuracy than the
different method.

More accurate hand contours are extracted from the


hand contour of the color image. We overlay the low
resolution candidate contour on the color image and define
the search area for the contour tracing. In our algorithm, we
restrict our search area from the area information gathered at
the time of hand detection process. We define our search area
from the depth image and from that search regions we search
the defined region by using the skin color classification
criteria. As, we define search regions of the high resolution
image from the candidate region defined by the low resolution
depth image. As we have to concentrate only on the small
region our hand tracking becomes faster and more accurate
than conventional algorithms. Experimental results describe
that our tracking is two to four times faster than conventional
hand tracking algorithms.

Conclusions

We proposed a novel method for extracting hands


features more quickly and accurately from color and depth
images at the same time using Kinect for real-time tracking.
The proposed method analyzes the histogram of the depth
image and finds appropriate threshold value to extract hand
region, and use information of color image to improve
accuracy rate of hand detection in order to overcome
shortcomings of depth image that has low resolution. Even
though the proposed method works under restricted
environment, we can detect hand directly without searching
for face or body unlike the conventional methods. We can use
our algorithm as new interface like keyboard and mouse due
to its low complexity and speed. Reducing the restricting
environment while detecting and tracking hand will be the
new research issue of this approach.

References

[1] A. A. Argyros and M. I. A. Lourakis, Binocular hand


tracking and reconstruction based on 2D shape matching, In
Proc. International Conference on Pattern Recognition (ICPR),
Hong Kong, China, 2006.
[2] M. Van den Bergh, F. Bosch, E. Koller-Meier and L.
Van Gool, Haarlet-based hand gesture recognition for 3D
interaction, Workshop on Applications of Computer Vision
(WACV), pp.1-8, December 2009.
[3] S. I. Kang, A. Roh and H. Hong, Using depth and skin
color for hand gesture classification, 2011 IEEE Internaional
Conference on Consumer Electronics (ICCE), pp.155-156,
January 2011.
[4] M. Van den Bergh, and L. Van Gool, , Combining
RGB and ToF Cameras for Real-time 3D Hand Gesture
Interaction, 2011 IEEE Workshop on Application of
Computer Vision (WACV), pp.66-72, January 2011.
[5] R. R. Igorevich, P. Park, D. Min, Y. Park, J. Choi and E.
Choi , Hand gesture recognition algorithm based on
grayscale histogram of the image, 4th International
Conference
on
Application
of
Information
and
Communication Technologies (AICT), pp.1-4, October 2010.
[6] M. Van den Bergh, D. Carton, R. De Nijs, N. Mitsou, C.
Landsiedel, K. Kuehnlenz, D. Wollherr, L. Van Gool and M.
Buss, Real-time 3D Hand Gesture Interaction with a Robot
for Understanding Directions from Humans, RO-MAN, 2011
IEEE, pp.357-362, July 31-August 3 2011.
[7] Paul Doliotis, Alexandra Stefan, Christopher
McMurrough, David Eckhard, and Vassilis Athitsos,
Comparing gesture recognition accuracy using color and
depth information, PETRA '11 Proceedings of the 4th

International Conference on PErvasive Technologies Related


to Assistive Environments, Article No. 20, NY, USA, 2011.
[8] M. J. Jones and J. M. Rehg, Statistical Color Models
with Application to Skin Detection, IEEE Computer Society
Conference on Computer Vision and Pattern Recognition,
Vol.1, June 1999.
[9] A. F. David, and M. F. Margaret, Automatic detection
of human nudes, International Journal of Computer Vision,
32(1):6377, August 1999.
[10] J. Z. Wang, J. Li, G. Wiederhold, and O. Firschein, ,
System for screening objectionable images using daubechies
wavelets and color histograms, In Proc. of the International
Workshop on Interactive Distributed Multimedia Systems and
Telecommunication Services, pages 2030, 1997.
[11] G. Simion, V. Gui, and M. Otesteanu, Finger Detection
Based on Hand Contour and Color Information, IEEE
International Symposium on Applied Computational
Intelligence and Informatics, May 1921, 2011.
[12] J. Kim and O. Chae, "Moving object tracking using edge
segment matching for mobile devices", 23rd KSPC
conference, Vol.23 No.1, pp.381, October 2010.
[13] D. H. Ballard, Generalizing the hough transform to
detect arbitrary shapes, Pattern Recognition, Vol.13, No.2,
p.111-122, 1981.
[14] M. Yokoyama, and T. Poggio, A contour-based moving
object detection and tracking, IEEE Intl. Workshop on
Visual Surveillance and Performance Evaluation of Tracking
and Surveillance, pp.271276, China, Oct. 2005.
[15] John Canny, "A computational approach to edge
detection", IEEE Trans. Pattern Anal. Mach. Intell., Vol.8, No.
6. (November 1986), pp. 679-698.
[16] G. Borgefors, Hierarchical chamfer matching: A
parametric edge matching algorithm, IEEE Trans. Pattern
Anal. Mach. Intell., Vol.10, No.6, pp.849865, Nov. 1988.
[17] Lu Xia, C.C. Chen and J. K. Aggarwal, Human
detection using depth information by Kinect, 2011 IEEE
Computer Society Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW), pp.15-22, June
2011.
[18] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M.
Finocchio, R. Moore, A. Kipman and A. Blake, Real-time
human pose recognition in parts from single depth images,
2011 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp.1297-1304, June 2011.

También podría gustarte