- Motivation
- Overview
- YOLO / Object Detection
- Resolving bounding-boxes via NMS
- Human Pose Estimation
- Human Body Models
- Research Papers & Model Selection Criteria
- 2D Pose Estimation
- 2 Keypoint Datasets: MPII and COCO
- Feeding our Image to our 2D HPE Model
- Dealing with Video
- Exploring Optical Flow
- Smoothing Using Optical Flow
- 3D Pose Estimation
- A Diversion: Exploring 6D Pose Estimation
- Skeleton Posture-Motion Feature Action Maps
- Action Classification and Recognition
- Unpleasant Decisions
- Final Thoughts
- Ciao! đź‘‹
Motivation
I used to play a lot of tennis every day in highschool. Then I stopped playing for a while and found myself trying to start again one summer, hitting against a tennis ball machine. I knew my technique had gotten pretty bad, but it was hard to improve without any direct feedback.
So, I decided to see what could be done about automating feedback using computer vision. Existing solutions at the time, such as Dartfish, were designed around lowering the friction of applying human (e.g. a coach) analysis to video (slow-mo, draw tools, storage/access, etc.).
The ultimate goal was to be able to input a self-recorded video of me playing tennis as an input and get specific feedback (follow-through, point-of-contact, spin) on a tennis stroke so that I could improve while playing on my own.
I can’t say that I completely succeeded, but the process was a fun dive into computer vision.
Overview
This post will be focused on breaking down the core development and structure of the implemented computer vision pipeline. I’ll also include approaches I considered but ultimately excluded. Specifics of productionizing and subsequent challenges (dataset labeling, training, model porting, product development) are generally out of scope for this post.
Here’s a high-level overview of the pipeline:
Note: Images presented, unless otherwise noted, are from the Kinetics-700 Human Action Dataset, a dataset of video clips of different human action classes (sourced via Youtube).
YOLO / Object Detection
Starting out, given a user video (assuming trimmed to only frames representing a given stroke), the first step in my mind was to use some kind of object detection method to identify and generate a bounding-box for the relevant tennis player.
This was part of the larger goal of aiming for a viewpoint-invariant approach; we didn’t want to assume players being at certain scales or orientations relative to the camera. With a bounding-box for a detected player, we can also reduce downstream input size/memory footprint (image dimensions).
Aside: Other products actually guide users to record from a specific position/orientation on a tennis court.
I ultimately selected the YOLOv3 model, due to its high accuracy, performance, easy scaling and portability (aiming for on-device detection, example repo). In this case, only one classification is relevant: “Person”.
Note: If using YOLOv3 for single-class detection, you can train the model on one class explicitly, rather than just filtering output to a single class: SINGLE-CLASS TRAINING EXAMPLE
An example of a bounding-box from YOLOv3:
Resolving bounding-boxes via NMS
The only bit I’m going to touch on YOLOv3’s internals/process is a common technique relevant to object detectors, used to finalize YOLOv3 detections: Non-maximum Suppression (NMS). This will show up later.
Essentially NMS is used to filter multiple, overlapping detections (bounding-boxes with confidence scores for object classes) and select the most relevant bounding-box to match a given detection. The procedure uses detection confidence scores and a metric called Intersection over Union (IOU; intersection of two boxes / their union) to select a final detection. A good overview can be found at the start of this article.
Non-Max Suppression
Human Pose Estimation
Ok, so we have a bounding-box around some person (tennis player). Let’s look into getting a pose estimate from it.
Human pose estimation: sometimes also called keypoint detection, typically refers to detecting locations of human keypoints (elbow, wrist, knees, etc.) from an image.
Our aim: generate a 3D human pose estimate for a detected person across all the frames in a video. It’s important for the estimation to be three-dimensional, since otherwise we won’t be able to give meaningful feedback, e.g. point-of-contact with ball on a tennis stroke, backswing depth, etc.
A commonly approach for 3D human pose estimation (HPE) is to treat the problem as a multistage inference task:
- Estimate 2D poses
- Reconstruct 3D poses from 2D poses
- Optionally link 3D poses over time/frames (multi-person pose estimation, we’ll largely ignore this)
There are also approaches based on unifying these stages, arguing that:
“Solving each step in isolation is sub-optimal and prone to errors that cannot be recovered in later stages… [particularly]... for monocular methods”
- TesseTrack: End-to-End Learnable Multi-Person Articulated 3D Pose Tracking.
In hindsight, I think they’re generally correct, with the caveat that separate stages allow for easier iteration and observability.
So I’ll start by breaking down the approach taken regarding 2D HPE, before proceeding to 3D HPE. But first, let’s establish a little more context regarding the problem and some decision-making criteria.
Human Body Models
Within the broad domain of HPE, there are different kinds of models (of the human body) that can represent humans, including: volumetric, planar, and kinematic models. This post is exclusively focused on kinematic models (stick figures) for both 2D and 3D HPE, which are the dominant models found in research, largely because of simpler loss metrics and problem space simplification.
I’ll be describing HPE tasks in the context of detecting the joints, or keypoints, of kinematic models.
Different human body models - Source
Research Papers & Model Selection Criteria
I won’t reference go over all the papers I read/tested, but the main criteria in my selection process were:
- Licensing: some work requires royalties for commercial use ( openpose - $25,000 USD annual royalty ). I wanted maximum flexibility for the future.
- Code & weight accessibility (how long would it take me to get a model up and running (training + GPU instances = $$)
- How easy would it be to port via something like ONNX to CoreML (iOS)
- Accuracy/performance.
2D Pose Estimation
For 2D HPE I ultimately selected an architecture introduced by Microsoft researchers: HighResolution Net (HRNet).
The Microsoft Research Blog describes some advantages: “In human pose estimation, HRNet gets superior estimation score with much lower training and inference memory cost and slightly larger training time cost and inference time cost.”
Some other advantages include:
- Pretrained estimators available
- MIT-licensed code
- MPII and COCO versions (more flexibility for subsequent 3D HPE model)
2 Keypoint Datasets: MPII and COCO
On that last bulletpoint, what do “MPII” & “COCO” refer to?
- COCO (Common Objects in Context) is a large-scale object detection, segmentation, and captioning dataset. It notably provides access to annotated human keypoint datasets for keypoint detection tasks.
- The MPII Human Pose dataset provides a dataset of images with annotated joints (keypoints).
They’re both datasets for benchmarking HPE approaches; they have some notable differences in how they model a 2D kinematic human skeleton, some of which are shown below:
Present in both | ||
head top | ❌ | |
right ankle | right ankle | âś… |
nose | ❌ | |
left shoulder | left shoulder | âś… |
… | … | … |
I experimented with inferring some of the keypoints in one specification from those in the other (e.g. estimating COCO[“nose”]
from MPII[“head top”, “upper neck”])
, results weren’t great but it gave me a hacky way for comparing models across different datasets.
Feeding our Image to our 2D HPE Model
Now we’re ready to actually feed our YOLOv3 generated bounding-box to our 2D HPE model (HRNet) and starting seeing 2D pose estimates.
Before that though, we need some understanding of the pre-processing expected by the HRNet model so we can both input video frames (+ computed bounding-boxes) and map keypoint inferences back to our original image dimensions.
The most relevant part is resizing input images to the 256x256 shape expected by the HRNet model, which we can do via an Affine Transformation (they preserve lines and parallelism - collinearity).
Here’s an example:
Affine Transformation (1280x654 —> 256x256 )
In code:
This is implemented using OpenCV’s getAffineTransform(dst, src
)
, where dst
and src
are points forming right triangles covering the upper left quadrant (hypotenuse from top-left to center) of destination and source images, respectively.
Later, we can use invertAffineTransform()
alongside the matrix initially computed by getAffineTransform(dst, src
)
to compute an inverse affine transformation matrix. This inverse transformation matrix can then be used to map our keypoint detections back to our original coordinate-space:
Here’s an initial 2D keypoint estimate, with labelled keypoints:
You can tell this is an MPII kinematic skeleton since it has a keypoint for “head top”
Aside: At this stage, I did have to go back and adjust the dimensions of the YOLOv3 bounding-boxes, as performance on later stages was brittle with respect to image dimensions. In the end, this was mostly just a case of original detections being a bit too tight.
Left - tight detection bounding-box; Right - enlarged detection bounding-box (note differences in lower-body keypoint estimation)
Dealing with Video
There’s an important distinction that I’ve largely ignored so far: the referenced models are designed for HPE on single images, while our context calls for video analysis. The addition of a temporal dimension results in both new challenges and opportunities for improvement.
Challenges:
- Noise (inconsistent frame-to-frame estimation changes, jittering)
- Missing detections (both in YOLOv3 and HPE models)
However, now that we’ve acknowledged a temporal dimension, new tools become available. I’ll quickly introduce one of them, Optical Flow, before explaining how we can integrate it into our frame-to-frame analysis to improve our keypoint detections.
Exploring Optical Flow
Optical flow is conceptually simple: it’s the motion of pixels between frames. There are some assumptions involved (pixel intensities are invariant across video frames, neighboring pixels exhibit similar movement), but essentially you can think about a measurement of optical flow as an answer to the question: “Given a pixel somewhere in frame x of video y, where has that pixel moved to in frame x+1?” It’s not perfect, but optical flow can serve as a fast, cheap approximation for expected movement. We’ll be using the Lucas–Kanade method to estimate optical flow.
Optical flow:
Optical Flow - Source
Our implementation will once again use OpenCV:
Smoothing Using Optical Flow
To deal with frame-to-frame inconsistency, we’ll use an algorithm derived from one published in Simple Baselines for Human Pose Estimation and Tracking Section 3.3 (“The Flow-based Pose Tracking Algorithm”).
The idea is to use Non-Maximum Suppression to unify bounding-boxes generated by optical flow (propagating HPE keypoints from an earlier frame) and an object detector (YOLOv3 box on current frame). This provides a number of benefits in a video context, such as counteracting missing frame-specific detections, smoothing out pose estimated across frames, and preventing a lot of awkward issues (e.g. detecting different people in different frames).
Here’s the relevant bit of the procedure from the paper:
And here’s a Python implementation with the tedious bits stripped out:
Cool, now that we have some tolerable 2D HPE keypoints and bounding-boxes, let’s move to 3D HPE.
3D Pose Estimation
For 3D HPE, I ended up using the model presented in, “Semantic Graph Convolutional Networks for 3D Human Pose Regression.” Since its designed for the task of 2D to 3D human pose regression, we’ll be feeding our 2D detections from the previous stage in as input.
3D Pose Regression from 2D Pose Estimate
Notably, since this model is meant for single images, it’s unable to leverage the temporal dimension available in a sequential context (video).
In hindsight, I really should have made more fundamental changes in order to incorporate temporal information earlier, since here I once again had to invest significant time to solve many of the same problems encountered earlier with 2D HPE.
I tried some creative approaches before ultimately settling on a basic sliding-window procedure over frame sequences to smooth noise out. There’s a little bit of caution needed here: tennis is full of quick, explosive movements and we don’t want to smooth out relevant signals.
Since many of the challenges here are essentially duplicated from those covered in the 2D HPE video-processing stage, I’ll cut the recounting short.
Still, we finally have 3D HPE keypoint coordinates! Now we need to convert these coordinates into usable inputs for action classification, so that we can start to classify a given video of a stroke (forehand, serve, volley).
First, though a quick tangent:
A Diversion: Exploring 6D Pose Estimation
Although I didn’t end up using it, I considered replacing elements of human pose estimation with object pose estimation (tennis rackets, in this case) in an effort to simplify the overall process. Object pose estimation in this context would imply estimating the position and orientation of someone’s tennis racket.
The “6D” aspect of “6D pose estimation” refers to the idea of estimating both an object’s position and orientation in a 3D context. I was specifically intrigued by a paper I saw, “Augmented Autoencoders: Implicit 3D Orientation Learning for 6D Object Detection,” for two reasons:
- It described a process to generate synthetic training data via simulating a given 3D object model (e.g. a CAD model) in a large variety of noisy images in different orientations and positions. Not having to worry about acquiring and labeling a dataset was a big potential win.
- I’d used Denoising Autoencoders before to impute missing values (not a great reason).
Accuracy was ultimately too poor (not surprising given synthetic data generation), here’s a bit of what the training process looked like:
3D CAD model of a tennis racket to use for generating a training dataset
Synthetic data generated for training, left - 3D model in noisy image, right - ground-truth position & orientation
Ok, let’s get back to turning our 3D HPE outputs into something useful.
Skeleton Posture-Motion Feature Action Maps
Right, so we want to try to normalize our 3D HPE sequences a bit, so that our action classification framework / feature extraction doesn’t need to deal with varying sequence lengths (sequences with different frame lengths).
We could try padding/compressing frame-sequences in order to normalize input lengths, but there’s a big risk of introducing some form of distortion.
In the process of searching for a solution, I stumbled onto this paper: Spatio–Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks which provided a pretty neat approach, encoding a sequence of 3D poses into a single RGB image (Skeleton Posture-Motion Feature action maps; SPMF action maps) capturing temporal and spatial dimensions. Incidentally, this also reduced the dependence on my custom smoothing procedure, as the proposed encoding scheme also applies a smoothing function to different predictions.
Generation of an SPMF, Source
An example SPMF
Alright, now let’s explore trying to get some useful data out of this magic image.
Action Classification and Recognition
My plan to build out action classification was to use Deep Convolutional Neural Networks based on the DenseNet architecture, essentially the approach taken by the earlier SPMF paper. From there, I wanted to apply some form of feature extraction procedure to extract specific stroke-level features, such as: point-of-contact, follow-through, and spin. With these features I hoped it would be possible to give some form of useful feedback.
However, there was a big difference in this phase compared with all earlier stages: no datasets! There were no annotated, publicly available tennis video datasets with stroke-level features (point-of-contact, follow-through, etc.). Moreover, even datasets that did have videos such as Kinetics700, only had a limited number of videos (1144 labeled as “playing tennis” for kinetics700).
I also checked out some other datasets:
- Olympic Sports Dataset
- Only Tennis Serves
- UCF101 - Action Recognition Data Set
- Tennis swings
- HACS: Human Action Clips and Segments Dataset
- No luck
- The YouTube Sports-1M Dataset
- Has tennis label
- Sports Videos in the Wild (SVW): A Video Dataset for Sports Analysis
- Has tennis label
So, now I was in a situation where accuracy for non-trivial features was essentially terrible.
Of course, I had known that accessible data would be more sparse as the relevant domain narrowed, but I massively underestimated the investment acquiring and labelling a useful training dataset implied.
Trivial features which were pretty easy to extract:
- Right-side
- Left-side
- Two-Hand
- One-Hand
- Follow-Through over Shoulder
Non-trivial features which I struggled to achieve reasonable accuracy with:
- Topspin
- Hitting up on the ball
- Volley
- Swinging Volley
Unpleasant Decisions
At this point, I had to make some decisions. Accuracy and feature extraction were proving to be really difficult tasks without access to deep pre-existing datasets.
I had to decide between:
- Dedicating a big chunk of time to create a truly useful product and hoping to find product-market fit.
- I couldn’t justify working on this for a year only for myself, so if it kept going I needed to try to make it useful for others.
- Letting this go and leaving it as a fun experiment.
Ultimately, I chose to leave this work behind, for the following reasons:
- I lacked experience coaching others and didn’t have enough exposure to a diverse set of coaches to be confident in my understanding of providing meaningful, generalized tennis feedback.
- Also individual differences in “correct” strokes implied that any feedback would likely need to be comparison-based.
- The data labeling/annotation problem implied a huge upfront cost (at one point I was trading Python lessons for my friends in exchange for data labeling!); I noticed competitors hiring for full-time data labeling roles, suggesting that this wasn’t a solved problem (they’re still hiring for those roles!).
- Lack of conviction in the market opportunity; talking to club players, coaches, and doing general market exploration gave me many reasons to take pause. My impression was also that there was a big long-tail of features needed to achieve perceived parity with existing solutions.
- The higher end of the market is also exposed to more high-margin solutions, e.g. multi-camera fixed position HD recording setups with costs running up to 5-figures
- My cofounder, who I brought on towards the tail-end of this work, also challenged my thinking and forced me to question some of my assumptions. We ended up pivoting to building some other stuff together.
Final Thoughts
Even though I would say overall that my approach was super crude and wasteful (there are now also mobile APIs for pose estimation), it was super fun to dive into a challenge like this; reading research papers isn’t boring, at least when there’s something actionable you’re looking to extract.
Some thoughts:
- Got invited to some startup accelerators and was offered opportunity to white-label my framework, but I didn’t want to undertake long-term commitments, particularly alone.
- Got to talk to a lot of interesting people in the process of building this, mostly via user interviews, so that was pretty cool.
- Gained a new dislike for big python projects which use numpy and throw around a bunch of magic numbers in big blocks of uncommented code.
- On the other hand, working out affine transforms on my whiteboard was kind of fun.
- From experience: if you do similar work, don’t carelessly forget to turn off a p2.xlarge instance for an extended period of time.
Lastly, if you want to see what automated feedback with pose estimation can look like when executed well check out SevenSix and Onyx.