Camera Mouse

CMPUT 498 Computer Vision
Vision-based Mouse Interaction Model

Mark McElhinney
April 26, 2005
Overview:
Computers are an integral and essential part of every industry today. However, there
remains only one primary method of interaction with most computers, namely the mouse.
Although practical and efficient in most circumstances, many situations would be better
served by other means such as touch-sensitive monitors. Many companies have created
just such devices; desktop monitors with touch sensitivity, or large scale interactive
whiteboards that work in conjunction with projectors (SMART Technologies Inc, GTCO,
etc).
Two of the primary set-ups for most interactive monitors are shown below:
Fig 1.1: Rear(left) and Front(right) Projection SMARTBoards

Pictures taken from www.smarttech.com
The technology in both cases limits the degree of interaction, however. In the case of the
front-projection setup, a multi-layered proprietary canvas detects contact when its layers
are pressed together with enough force. This limits the number of reported contacts to
one at a time, and no information as to the size of contact is known. In the case of the
rear-projection DViT (Digital Vision Touch) system, 3 or 4 digital cameras are used to
triangulate the location, size, and number of contacts. However, these cameras are
located in the corners of the screen, and look for contact parallel to the screen’s surface.
As a result, a small number (2-3) contacts can be reported, and fewer in cases when one
contact occludes the location of another.
It would be more useful for some purposes to be able determine larger numbers of
contacts, size of individual contacts, and even shape of contacts. This would allow a
much more diverse interaction model, providing endless degree of flexibility and use
(imagine a whole hand contact acting as a panning tool, and a single finger acting as a
mouse).
It is the goal of this project to set up the framework and prove the feasibility (or lack of
feasibility) of a touch based interaction system based on a rear-projection and rear-
capture camera setup.
Hardware:
Like other rear-projection devices, I used a home-made combination of a projector,

mirror, Plexiglas surface, and a plywood cabinet to perform my tests. My original thoughts
led me to a design such as the one below:
While building it, however, I realized that a more usable device could be created by
effectively tilting the entire device 25-30 degrees backwards. This would allow the cabinet
to sit on the floor while I performed my tests, and a direct line of sight perpendicular to the
touching surface is created. Additionally, a projector was added behind the camera
(projecting directly above the camera) to make the set-up complete. The following are
some images of the test set-up:
Although made very inexpensively, the setup actually produced quite good results. The
front fogged-glass surface used to display the projection was actually made of a clear
sheet of Plexiglas sanded with 120 grit sandpaper. Other surfaces may have performed
better but for my purposes this was adequate. I used a basic Firewire webcam video
camera for image capturing and used a Videoball projector to project the desktop image
onto the mirror and ultimately onto the screen.
Software
In order to track contacts on the Plexiglas surface, I had to design a software component
to analyze captured images and interpret them as contacts. To do this, I used Visual C++
on a Windows XP machine. I developed a MFC forms application that displays the raw
video captured from the webcam in one frame and displays either the filtered and post-
processed image or sequence of images in the other frame.
I implemented two separate methods to attempt to isolate regions of the image that
coincided to screen contacts. I created a section where I could capture individual frames
from the input source, and incrementally apply image processing techniques. In this
section, implemented the following image processing algorithms:
- thresholding between two values
- 3x3 median filter
- 3x3 mean filter
- pixel removal based on key-stoned image
- conversion to grayscale
The second and most useful method, seen on the right of the screenshot, performs a
sequence of processing techniques in set order on either one or more images. By
choosing ‘Process Video’, a timer is used to process a frame on set intervals, displaying
the X and Y coordinates of the centroid of the most probable contact on the screen in the
labels below.
As seen on the next page, a contact is detected and a bounding box created around it.
Additionally the centroid is reported at (190, 156).
Implementation:
A large variety of methods were considered when trying to determine which methods
could be used to accurately detect surface contact with only one rear-capturing camera.
Originally the thought was to use the image subtraction of the known desktop image
projected on the back of the screen and the captured screen image from the camera to
determine locations of high differences. However, this idea does not seem possible with
my hardware implementation, as there is a lot of glare on the back of the screen from the
projection. Additionally, I found the quality of the webcam to be insufficient (specifically
the result of noise captured) to get an accurate color sampling of the image on the
screen. The image subtraction idea simply did not work.
After this set back, I instead started to work with just image processing on captured
images when the projector was on, off, and in different lighting conditions. As I was
working in fairly poor lighting conditions, I found it easier to work with the lights in my
room out, and only using the natural light from my window in the room. By doing this,
there was not a dominating light source, so there was not a large silhouette of everything
between the light source and the screen. With the diffuse light from the window, I was
able to detect contacts both while the projector was on and while it was off.
To do this, I performed the following operations:
1. Removed pixels that were outside the screen region
Using the homography section of the dialog
(Note: I did not actually implement a homography warp, just removed pixels that
lay outside the specified region)
2. Perform a 3x3 mean filter on the image to smooth out noise (Ideally, I would
use a median filter, but that proved to be too slow)
3. Perform thresholding on the result using the values inputted in the threshold
section of the dialog
These three steps left me with an image that had a variety of different connected regions
of values within the threshold. Ideally, there would only be one region, where the contact
actually was. However, due to noise, at times there was more than one region, or large
regions with intermittent white pixels. I designed and implemented an algorithm to find all
of the connected regions and define their centroid width and height once found. To do
this, I used a depth first traversal of each region by using a recursive method on each
newly found white pixel in the image. Here is the pseudo code:
for(i = 0; i < width; i++){

for(j = 0; j < width; j++){
if(image[i,j] == 255){
processRegion(i,j, image);
}
}
}
void processRegion(int i, int j, Image image){

region.add(i, j);
// process pixel to the right
if(image[i+1, j] == 255){
image[i+1, j] = 128; // make it non-255 so doesn’t get processed twice
createRegion(i+1, j, image); // recursive call to process the next pixel
}
// process pixel below (j+1)
…
// process pixel to the left (i-1)
…
// process pixel above (j-1)
…
}
After doing this, we have a region for each connected area in the image, and calculate
the centroid, width, and height by finding the minimum and maximum values in the x and
y directions in each region. With this information, a variety of things can be done. By
using the ratio of width to height, we can limit our selected regions to those that represent
somewhat square regions. Additionally, by knowing the location of the last contact
(centroid), we could isolate further searches to a larger encapsulating region around it to
make things more efficient (I chose not to do this due to time constraints). We can also
choose to ignore those regions that are too small to be an actual contact and are
probably attributable to noise.
My Contact Classification Criteria
I chose my classification criteria based on proportionality of width to height, and absolute

size. The height and width of a proposed contact region must be greater than 3 pixels,
and proportion of width to height must be between 0.5 and 2. Other classification criteria
could easily be added, making the tracking much more robust.
Results:
Negative
The image processing techniques, as well as the hardware, were quite straight forward
for this project. Despite this, a lot of effort was required to determine the best combination
of filtering, thresholding, and region analysis. As mentioned previously, one of the
reasons for this difficulty was the ever varying and inconsistent lighting conditions. When
there is only one source of light, or very limited lighting, intensity thresholding can
become very difficult and unpredictable. For my purposes, I found that the only way that I
could overcome this was by controlling the environment that my tests were done, so that I
could accurately find the range of values to be thresholded. Some more work could be
put into some method of determining what type of light is present (some sort of luminosity
sensor) and dynamically adjusting the thresholds based on that.
Despite the success I found with both the backlit and non-backlit contact detection, it
remained inconsistent. I found that with my actual Plexiglas surface, rather than just
omitting light from the front, the camera was able to actually see the finger’s contact on
the screen. While it would certainly be possible to use that to my advantage (by using
HSV image analysis instead) and thresholding based on some sampled color, I did not
have the time to pursue this. Instead, to make things more consistent, I simply wore a
small black glove when performing my experiments.
Additionally, the coordinates returned from my region tracking algorithm are not warped
into actual screen coordinates as would be necessary for this to be useful in practice. An
orientation procedure could be created to determine four point correspondences and
using the Direct Linear Transformation (DLT) algorithm or Gold Standard algorithm, a
homography warp could be found that would warp the original image into screen
coordinates. Once again, this was something that although easily implemented in Matlab,
proved difficult in VC++ and was beyond the scope for this project.
Finally, this implementation was quite computationally intensive. Running on a P4 at 2.4

Ghz, my application required between 70 and 90% of the cpu while capturing and
analyzing the contacts. Although this could be dramatically reduced if the actual
displaying and UI was not present, it is unlikely that this would be useful in the practical
sense to replace the use of an actual mouse or other contact device.
Positive
Despite all of the negative comments above, there was still some good to come from this
project. I was able to somewhat accurately locate and identify multiple contacts on the
screen simultaneously in real time. This was done without the use of SSD tracking or any
initialization period or training (albeit in constrained conditions). The contact tracking was
successful both when the rear-projection was active and when it was inactive, showing
that it was unnecessary to use any sort of subtraction between desktop and captured
images as originally thought.
This setup was very inexpensive to develop and could easily be reproduced to do
additional tests on. In total, it cost me only $20 in wood to construct and makes use of a
very inexpensive web camera (~$70).
Although this might not be a directly useful contact detection system on its own, it could
be used quite effectively in combination with other technologies. For example, a system
could easily be created that uses as a primary contact detection the DViT or analog
resistive technology, and uses this rear-capture technology to ascertain more specific
properties from the area of contact.
Usage:
After unzipping the CameraMouse.zip file, you can run the software in one of two ways.
Browse to CameraMouse/CameraMouse/Debug and double click on CameraMouse.exe.
Alternatively, you could open the Visual Studio 2003 project file by opening
CameraMouse.sln in the CameraMouse folder.
Note, the process frame and process video buttons will cause OutOfMemory errors and
exit the application if the captured image is too bright as the search algorithm gets
overwhelmed.
Sample Video:
A sample video sequence is also included in the zip file in the root of the folder. It is
named test.rm and can be run using Real Player.

Camera Mouse

Cargado por

Información del documento

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Camera Mouse

Cargado por

Copyright:

Formatos disponibles

CMPUT 498 Computer Vision

Vision-based Mouse Interaction Model

Fig 1.1: Rear(left) and Front(right) Projection SMARTBoards

Like other rear-projection devices, I used a home-made combination of a projector,

To do this, I performed the following operations:

1. Removed pixels that were outside the screen region

Using the homography section of the dialog

for(i = 0; i < width; i++){

void processRegion(int i, int j, Image image){

My Contact Classification Criteria

I chose my classification criteria based on proportionality of width to height, and absolute

Finally, this implementation was quite computationally intensive. Running on a P4 at 2.4

También podría gustarte