People Tracking using a Depth Camera


People tracking using blob.



Depth cameras are increasingly popular in building automation, occupancy management, security and access control, enabled by low-cost depth sensors like the OPT8241 and OPT8320 3D Time-of-Flight chipsets from Texas Instruments.  A key benefit depth cameras is the ability to use depth to segregate foreground from the background.  Once foreground objects are isolated, they can be recognized, tracked and counted using modern image processing algorithms available in OpenCV.  In this post, I will describe how to use the OPT8241-CDK-EVM depth camera, BSD-licensed Voxel SDK, and OpenCV to create a simple people counter and tracking application.

The general strategy of people counting and tracking is as follow:

  1. Foreground-Background Separation
  2. Convert to Binary Image and Apply Morphology Filters
  3. Shape Analysis
  4. Tracking

Foreground-Background Separation

Foreground-background separation starts with registering the background, which is necessary before one can separate foreground from background through image subtraction–if depth camera is used, image substraction would be between two depth images.  Setting the background could be as simple as capturing a frame when the scene is absent of foreground objects.  But simple approach also means background objects that may have subsequently moved will be detected, though noticing the initial change would be interesting to some applications.   A more sophisticated approach would be to slowly fade in any alteration back into the background, if the alteration is not from objects being tracked, and the alteration is no longer changing.  The first would require recognition; the latter is determined if there is sustaind period of no-change in the altered areas in image.  If the sophisticated approach is adopted, the defnition of foreground is the fast-changing component of the scene, and background is the slow-changing component.  The rate at which the foreground fades into the background should be a programmable parameter that depends on the type of applications.  After subtraction, the result would be from newly present or absence of objects.  To reduce the impact of camera noise, the “foreground” may need to be further qualified by minimum delta depth (“thickness”) and minimum amplitude (“brightness”).

The code example below illustrates a simple case of foreground-background separation:

void Horus::clipBackground(Mat &dMat, Mat &iMat, float dThr, float iThr)
   for (int i = 0; i < dMat.rows; i++) {
      for (int j = 0; j < dMat.cols; j++) {
         float val = (,j) > iThr &&,j) > dThr) ? 255.0 : 0.0;,j) = val;

where iThr is the intensity threshold, and dThr is the depth threshold.

Binary Image and Morphology Filter

The foreground from subtraction may contain speckles to noise, as noise varies from frame to frame.  The morphology operators can be applied to remove speckles and fill in small gaps.  The open operator first erodes the image using the chosen morphology element, then dilates the result to fill in the gaps and smooth the edges.  The OpenCV example is given below, where the image on the left is the original image, and image on the right is the result after applying the open operator.  Note the small hole and gaps are filled in.

Shape Analysis

After the foreground is isolated as a binary image, shape analysis can be performed to find individual objects in the foreground.  This step is where people counting solutions vary–algorithms that differentiate people from objects with high accuracy are considered superior than those that do not.   People tracking algorithms depend heavily on camera angles.  Algorithms for ceiling-mounted camera are generally more simple than those for corner-mounted cameras because from the ceiling “people” look like well-formed blobs; but from the corner, “people” become complex overlapping silhouettes which harder to separate.  Several relevant shape analysis algorithms useful in people tracking and coutning are described below.  Most of them are available in OpenCV.

Blob Anlaysis

Blob analysis works by connecting joined, self-enclosing regions in the foreground sharing common properties such as area, thresholds, circularity, inertia and convexity. Proper selection of these properties can greatly enhance accuracy.  A great summary article on blob analysis with example code is available from Satya Mallick.


Blob analysis works best when the camera is ceiling mounted, because people will generally look like well-formed blobs from that camera angle.  However, people in physical contact with one another can cause their blobs to join, leading to miscount.  The erode operator is useful in this case, as it can split thinly connected blobs.   Even though blob analysis is a natural fit for ceiling-mounted cameras, it can be appllied to corner-mounted cameras if the overlapping issue can be resolved.  One way to deal with overlap is to “slice” the observed volume along the camera’s z-axis and perform blob analysis one “slice” at a time.

Contour Analysis

Foreground shapes can also be recognized and tracked by contours, a list of points that form a self-enclosing outline of the foreground object it encloses. A contour has a length and an enclosed area.   A point in the image can be inside or outside a contour; and a contour can be nested inside another; but contours do not cross path.  Contours can be compared for similarity.  With proper setting of this set of properties to reflect those of a “person”, the number of contours in the foreground becomes a people counter.

A key benefit of contours is the ability to identify appendages, or body parts, such as fingers, legs, arm, shoulder.  This ability is available through contour operators like convex hull and convexity defects.  In the example below, convex hull is the vertices of the green convex polygon; and convexity defects are the red points at the bottom of “valleys”.  The “valleys” are called convexity defects because they represent violations of convexity. Once convex hull and convexity defects are identified, together with the contour centroid and some heuristics, they identify head, arms, legs of a person.


Region Growing

For corner-mounted cameras, people in the foreground may overlap, especially in a crowded room.  The point cloud of the foreground pixels should be exploited to group points belonging to the same individual. The region growing algorithm can be applied to group pixels having similar z \cos(\theta) distance from the camera, where \theta is the camera pitch angle.

The first step is finding suitable seeding points.  One way is to histogram each foreground blob and identify the top 2-3 local maxima, but the maxima must meet some mininum separation requirement.  Then seed the point in each maxima that is closest to the centroid of all points belonging to the same maxima.  To grow the region, set each seed as the center, then scan the 8 neighbors to qualify or disquality them into the group based on the z \cos(\theta) distance.  Then with each qualified neighbor as the center, repeat the same 8-neighbor scan to expand the group.  The result of an example is given in the figure below.

Region growing algo

Region Growing Algorithm in People Counting [1].


In some applications, tracking the movement of people in a room is important.  For example: monitoring presence of suspicious or unusual activities, or quantifying the interest of a crowd to particular products or showcases.  Tracking also enables one to maintain proper head count in situations where people may be partially or even fully occluded.  In these scenarios, if tracker has not detected any “people” leaving the scene from the sides of the camera view, then any disappearing blobs must be due to occlusion, therefore head count must remain unchanged.  Tracking requires matching foreground entities in consecutive frames.  The matching can be based multiple criteria, such as shortest centroid displacement and similarity of contour shape and intensity profile.  Subtraction of consecutive frames will also give excellent indication of direction of motion, enabling prediction of where in the new frame the tracked object is.

Simple Code Example

The code snippet below illustrate people tracking and counting that I described above.  The #if  macro comments selects between blob tracking and contour tracking.

void Horus::update(Frame *frame)
   vector< vector > contours;
   vector hierarchy;
   RNG rng(12345);

   if (getFrameType() == DepthCamera::FRAME_XYZI_POINT_CLOUD_FRAME) {

      // Create amplitude and depth Mat
      vector zMap, iMap;
      XYZIPointCloudFrame *frm = dynamic_cast(frame);
      for (int i=0; i< frm->points.size(); i++) {
      _iMat = Mat(getDim().height, getDim().width, CV_32FC1,;
      _dMat = Mat(getDim().height, getDim().width, CV_32FC1,; 

      // Apply amplitude gain
      _iMat = (float)_ampGain*_iMat;

      // Update background as required
      if (!_setBackground) {
         _setBackground = true;
         cout << endl << "Updated background" << endl;

      // Find foreground by subtraction and convert to binary 
      // image based on amplitude and depth thresholds
      Mat fMat = clipBackground((float)_depthThresh/100.0, (float)_ampThresh/100.0);

      // Apply morphological open to clean up image
      fMat.convertTo(_bMat, CV_8U, 255.0);
      Mat morphMat = _bMat.clone();
      Mat element = getStructuringElement( 0, Size(5,5), cv::Point(1,1) );
      morphologyEx(_bMat, morphMat, 2, element);

      // Draw contours that meet a "person" requirement
      Mat drawing = Mat::zeros( _iMat.size(), CV_8UC3 );
      Mat im_with_keypoints = Mat::zeros( _iMat.size(), CV_8UC3 );
      cvtColor(_iMat, drawing, CV_GRAY2RGB);

      int peopleCount = 0;

#if 1
      // Find all contours
      findContours(morphMat, contours, hierarchy, CV_RETR_TREE, CV_CHAIN_APPROX_SIMPLE, cv::Point(0,0));
      for ( int i = 0; i < contours.size(); i++ ) { 
         if (isPerson(contours[i], _dMat)) {  
            drawContours( drawing, contours, i, Scalar(0, 0, 255), 2, 8, vector(), 0, cv::Point() ); 
      // Find blobs
      std::vector keypoints;
      SimpleBlobDetector::Params params;

      // Filter by color
      params.filterByColor = true;
      params.blobColor = 255;

      // Change thresholds - depth
      params.minThreshold = 0;
      params.maxThreshold = 1000;

      // Filter by Area.
      params.filterByArea = true;
      params.minArea = 100;
      params.maxArea = 100000;

      // Filter by Circularity
      params.filterByCircularity = false;
      params.minCircularity = 0.1;
      // Filter by Convexity
      params.filterByConvexity = false;
      params.minConvexity = 0.87;
      // Filter by Inertia
      params.filterByInertia = false;
      params.minInertiaRatio = 0.01;

      cv::Ptr detector = cv::SimpleBlobDetector::create(params); 
      detector->detect( morphMat, keypoints );

      cout << "Keypoints # " << keypoints.size() << endl;

      for ( int i = 0; i < keypoints.size(); i++ ) { 
	 cv::circle( drawing, cv::Point(keypoints[i].pt.x, keypoints[i].pt.y), 10, Scalar(0,0,255), 4 );
      peopleCount = keypoints.size();

      putText(drawing, "Count = "+to_string(peopleCount), cv::Point(200, 50), FONT_HERSHEY_PLAIN, 1, Scalar(255, 255, 255));

      imshow("Binary", _bMat);
      imshow("Amplitude", _iMat); 
      imshow("Draw", drawing);
      imshow("Morph", morphMat);

Below is a video of the people tracking using contour:


  1. Method For Segmentation Of Articulated Structures Using Depth Images for Public Displays

28 thoughts on “People Tracking using a Depth Camera

  1. Debasish Mitra

    Hello Larry,

    Hope you are having a good day. You are performing background substraction on infrared image or RGB image? Given there are many special cameras that provide infra-red image separately along with depth and RGB cam data.

    I have found the background substraction on visible spectrum of light to be highly susceptible to light changes. Does doing background susbtraction on infrared improves the accuracy of background substraction?

    • Debasish Mitra

      OK got it. Usually many depth camera are combine infrared laser and then capture image in colour space and infrared to create depth map and gives access to infrared image, colour space image and depth space feed.

  2. Debasish Mitra

    Hello Larry. It looks like you have also worked in deep learning field like computer vision. Can you give an idea which would solution for people counting will be more accurate, specially a solution which works best even in cases of occlusion – a front dual camera depth based solution (like proposed here) or single camera deep learning based people counting?

    • @Debash Deep learning is good at finding desired target in an image with some noise. Partial occlusion maybe solvable by DNN with reduced confidence. Momentary full occlusion will require temporal DNN model like LSTM. Persistent full occlusion will require some form of belief model to maintain object permanence.

    • Debasish Mitra

      I understand that DNN based models generally handle occlusion pretty well but what about accuracy when compared to such depth based sensor approach? Any personal experience?

    • Debasish Mitra

      Actually I have developed the depth sensing based people counting as suggested by this post and I have yet to test the accuracy and ability of such a system at different real world environment. I was thinking about implementing a DNN based people counting too and merging the results but doubt about the increase in accuracy.

  3. Debasish Mitra

    I have performed background substraction and threshold to obtain a binary image (pixel values either 0 or 255). Also, from stereo image I have depth information, scaled from 0 to 255 max. I have multiplied these two images pixel wise. So the images now have only foreground blobs with pixel values having depth info. Now I can go ahead with your blob approach and count people through region growing approach. I wanted to know would the contour based approach work on occluded people? Also given depth info of same person will be varying slightly through out the same person’s pixel region, would the contour opencv operation handle it implictly?

    • There probably several ways to deal with occlusion. One way is seeding at the centroid of a blob, and grow the blob by including neighboring pixels if their depth does vary too much (smooth surface criteria.

    • Debasish Mitra

      Can you add isPerson() code, regarding how you classified a contour is a person or not? I believe you tested minimum area but apart from that convexity values and other params. Did this classification alone would work fine for items like shopping trolly/carts? Also, how did or how would you handle shadows? Just depend on background substraction algorithm to detect and remove them?

    • isPerson() is frankly arbitrary. More simplistic way is to test for minimum aspect radio (standing person), convexity (at least so many convex hulls or convexity defects). Or you can try train a DNN to do the recognition. There is no one way to do this.

  4. Arun

    The article was really nice, it gave me an overview about how to start my exploration in this field. In fact am new to 3D related analytics, I have did analytics in 2D cameras. If you can share any source of information related to this, it would me much appreciated. Thanks!!!

  5. Jafar

    Thank you very much man,

    I have run the code. Which window is the depth map? I was wondering How to transform the code into a reduced one that only serves to extract the raw depth map? I kinda have poor c++ experience. would you please give me some clue?

  6. Jafar

    Yes, I’m all confused. Would you please share with me the code for getting depth images to be displayed by openCV?

  7. Jafar

    Hey man, I thought you could help me out with my project. I’m also working on an algorithm that makes use of opt8241 camera. However, when extracted with openCV, the depth map seems to be extremely poor, specially in case of moving objects.

    I was wondering if you can give me some advice or tips on how to get quality depth data from this sensor. Maybe some specific sensor configuration is required or something is wrong with my code piece where the data is transfered from VoxelSDK into openCV,

    I’d much appreciate your help,

    • Can you attach some images you’re getting? Generally speaking, increasing illumination power, integration duty cycle will improve the depth image quality. How far away are you trying to range?

    • Jafar

      Thanks a lot for the reply,
      unfortunately for the moment, I have no chance to get you an example image, but I’ll attach my own analysis of the sensor, which was written a while ago, below:
      “The depth values undergo large variations from frame to frame for the static scene and show huge instability. The standard deviation (std) of some random points was calculated to determine the stability of the depth map delivered by the sensor. The calculations were performed for multiple capturing runs, each indicating a given number of frames. All the points were on a plane 3 meters away from the sensor. From this std, we can conclude that the sensor maintains pretty stable performance only for a small area right under its illumination unit and the data stability declines as the points move toward the frame borders. In simpler terms, the line going from the sensor to each point forms an angle realtive to the right line from the sensor to the ground (the plane) and with an increase in this angle, the stability gets worse. The sensor can not capture motion. As a result, a moving object is not detected unless it remains stationary.”

  8. Joseph Lin

    Hi, Larry.
    I am new to OPT8241-CDK-EVM.
    I don’t understand how to get or show frames from it.
    I’ve built the VoxelSDK and the example code on Ubuntu but just got binary files while running it.
    Could you tell me how you do that? Thank you.

Leave a Reply