Motivation
Overview
YOLO / Object Detection
Resolving bounding-boxes via NMS
Human Pose Estimation
Human Body Models
Research Papers & Model Selection Criteria
2D Pose Estimation
2 Keypoint Datasets: MPII and COCO
Feeding our Image to our 2D HPE Model
Dealing with Video
Exploring Optical Flow
Smoothing Using Optical Flow
3D Pose Estimation
A Diversion: Exploring 6D Pose Estimation
Skeleton Posture-Motion Feature Action Maps
Action Classification and Recognition
Unpleasant Decisions
Final Thoughts
Ciao! 👋

Motivation

I used to play a lot of tennis every day in highschool. Then I stopped playing for a while and found myself trying to start again one summer, hitting against a tennis ball machine. I knew my technique had gotten pretty bad, but it was hard to improve without any direct feedback.

So, I decided to see what could be done about automating feedback using computer vision. Existing solutions at the time, such as Dartfish, were designed around lowering the friction of applying human (e.g. a coach) analysis to video (slow-mo, draw tools, storage/access, etc.).

The ultimate goal was to be able to input a self-recorded video of me playing tennis as an input and get specific feedback (follow-through, point-of-contact, spin) on a tennis stroke so that I could improve while playing on my own.

I can’t say that I completely succeeded, but the process was a fun dive into computer vision.

Overview

This post will be focused on breaking down the core development and structure of the implemented computer vision pipeline. I’ll also include approaches I considered but ultimately excluded. Specifics of productionizing and subsequent challenges (dataset labeling, training, model porting, product development) are generally out of scope for this post.

❗

Disclaimer: This post is going to be largely focused on the technical details of a computer vision pipeline, as I implemented it. It does not describe an optimal or desirable approach (I would do MANY things differently in hindsight), but documents an initial approach taken and explains relevant domain concepts.

Here’s a high-level overview of the pipeline:

Note: Images presented, unless otherwise noted, are from the Kinetics-700 Human Action Dataset, a dataset of video clips of different human action classes (sourced via Youtube).

YOLO / Object Detection

Starting out, given a user video (assuming trimmed to only frames representing a given stroke), the first step in my mind was to use some kind of object detection method to identify and generate a bounding-box for the relevant tennis player.

This was part of the larger goal of aiming for a viewpoint-invariant approach; we didn’t want to assume players being at certain scales or orientations relative to the camera. With a bounding-box for a detected player, we can also reduce downstream input size/memory footprint (image dimensions).

Aside: Other products actually guide users to record from a specific position/orientation on a tennis court.

I ultimately selected the YOLOv3 model, due to its high accuracy, performance, easy scaling and portability (aiming for on-device detection, example repo). In this case, only one classification is relevant: “Person”.

Note: If using YOLOv3 for single-class detection, you can train the model on one class explicitly, rather than just filtering output to a single class: SINGLE-CLASS TRAINING EXAMPLE

An example of a bounding-box from YOLOv3:

Resolving bounding-boxes via NMS

The only bit I’m going to touch on YOLOv3’s internals/process is a common technique relevant to object detectors, used to finalize YOLOv3 detections: Non-maximum Suppression (NMS). This will show up later.

Essentially NMS is used to filter multiple, overlapping detections (bounding-boxes with confidence scores for object classes) and select the most relevant bounding-box to match a given detection. The procedure uses detection confidence scores and a metric called Intersection over Union (IOU; intersection of two boxes / their union) to select a final detection. A good overview can be found at the start of this article.

Non-Max Suppression

Human Pose Estimation

Ok, so we have a bounding-box around some person (tennis player). Let’s look into getting a pose estimate from it.

Human pose estimation: sometimes also called keypoint detection, typically refers to detecting locations of human keypoints (elbow, wrist, knees, etc.) from an image.

Our aim: generate a 3D human pose estimate for a detected person across all the frames in a video. It’s important for the estimation to be three-dimensional, since otherwise we won’t be able to give meaningful feedback, e.g. point-of-contact with ball on a tennis stroke, backswing depth, etc.

A commonly approach for 3D human pose estimation (HPE) is to treat the problem as a multistage inference task:

Estimate 2D poses
Reconstruct 3D poses from 2D poses
Optionally link 3D poses over time/frames (multi-person pose estimation, we’ll largely ignore this)

There are also approaches based on unifying these stages, arguing that:

“Solving each step in isolation is sub-optimal and prone to errors that cannot be recovered in later stages… [particularly]... for monocular methods”

- TesseTrack: End-to-End Learnable Multi-Person Articulated 3D Pose Tracking.

In hindsight, I think they’re generally correct, with the caveat that separate stages allow for easier iteration and observability.

💬

Monocular methods (what this post assumes) refer to working with images that don’t have any depth information and only one perspective. However, since as of iOS 11 pictures can incorporate depth data, a better approach would likely incorporate depth information into the inference process.

So I’ll start by breaking down the approach taken regarding 2D HPE, before proceeding to 3D HPE. But first, let’s establish a little more context regarding the problem and some decision-making criteria.

Human Body Models

Within the broad domain of HPE, there are different kinds of models (of the human body) that can represent humans, including: volumetric, planar, and kinematic models. This post is exclusively focused on kinematic models (stick figures) for both 2D and 3D HPE, which are the dominant models found in research, largely because of simpler loss metrics and problem space simplification.

I’ll be describing HPE tasks in the context of detecting the joints, or keypoints, of kinematic models.

Different human body models - Source

Research Papers & Model Selection Criteria

I won’t reference go over all the papers I read/tested, but the main criteria in my selection process were:

Licensing: some work requires royalties for commercial use ( openpose - $25,000 USD annual royalty ). I wanted maximum flexibility for the future.
Code & weight accessibility (how long would it take me to get a model up and running (training + GPU instances = $$)
How easy would it be to port via something like ONNX to CoreML (iOS)
Accuracy/performance.

2D Pose Estimation

For 2D HPE I ultimately selected an architecture introduced by Microsoft researchers: HighResolution Net (HRNet).

The Microsoft Research Blog describes some advantages: “In human pose estimation, HRNet gets superior estimation score with much lower training and inference memory cost and slightly larger training time cost and inference time cost.”

Some other advantages include:

Pretrained estimators available
MIT-licensed code
MPII and COCO versions (more flexibility for subsequent 3D HPE model)

2 Keypoint Datasets: MPII and COCO

On that last bulletpoint, what do “MPII” & “COCO” refer to?

COCO (Common Objects in Context) is a large-scale object detection, segmentation, and captioning dataset. It notably provides access to annotated human keypoint datasets for keypoint detection tasks.
The MPII Human Pose dataset provides a dataset of images with annotated joints (keypoints).

They’re both datasets for benchmarking HPE approaches; they have some notable differences in how they model a 2D kinematic human skeleton, some of which are shown below:

COCO keypoints	MPII keypoints	Present in both
	head top	❌
right ankle	right ankle	✅
nose		❌
left shoulder	left shoulder	✅
…	…	…

I experimented with inferring some of the keypoints in one specification from those in the other (e.g. estimating COCO[“nose”] from MPII[“head top”, “upper neck”]), results weren’t great but it gave me a hacky way for comparing models across different datasets.

Feeding our Image to our 2D HPE Model

Now we’re ready to actually feed our YOLOv3 generated bounding-box to our 2D HPE model (HRNet) and starting seeing 2D pose estimates.

Before that though, we need some understanding of the pre-processing expected by the HRNet model so we can both input video frames (+ computed bounding-boxes) and map keypoint inferences back to our original image dimensions.

The most relevant part is resizing input images to the 256x256 shape expected by the HRNet model, which we can do via an Affine Transformation (they preserve lines and parallelism - collinearity).

Here’s an example:

Affine Transformation (1280x654 —> 256x256 )

In code:

This is implemented using OpenCV’s getAffineTransform(dst, src), where dst and src are points forming right triangles covering the upper left quadrant (hypotenuse from top-left to center) of destination and source images, respectively.

Later, we can use invertAffineTransform() alongside the matrix initially computed by getAffineTransform(dst, src) to compute an inverse affine transformation matrix. This inverse transformation matrix can then be used to map our keypoint detections back to our original coordinate-space:

Here’s an initial 2D keypoint estimate, with labelled keypoints:

You can tell this is an MPII kinematic skeleton since it has a keypoint for “head top”

Aside: At this stage, I did have to go back and adjust the dimensions of the YOLOv3 bounding-boxes, as performance on later stages was brittle with respect to image dimensions. In the end, this was mostly just a case of original detections being a bit too tight.

Left - tight detection bounding-box; Right - enlarged detection bounding-box (note differences in lower-body keypoint estimation)

Dealing with Video

There’s an important distinction that I’ve largely ignored so far: the referenced models are designed for HPE on single images, while our context calls for video analysis. The addition of a temporal dimension results in both new challenges and opportunities for improvement.

Challenges:

Noise (inconsistent frame-to-frame estimation changes, jittering)
Missing detections (both in YOLOv3 and HPE models)

However, now that we’ve acknowledged a temporal dimension, new tools become available. I’ll quickly introduce one of them, Optical Flow, before explaining how we can integrate it into our frame-to-frame analysis to improve our keypoint detections.

Exploring Optical Flow

Optical flow is conceptually simple: it’s the motion of pixels between frames. There are some assumptions involved (pixel intensities are invariant across video frames, neighboring pixels exhibit similar movement), but essentially you can think about a measurement of optical flow as an answer to the question: “Given a pixel somewhere in frame x of video y, where has that pixel moved to in frame x+1?” It’s not perfect, but optical flow can serve as a fast, cheap approximation for expected movement. We’ll be using the Lucas–Kanade method to estimate optical flow.

Optical flow:

Optical Flow - Source

Our implementation will once again use OpenCV:

Smoothing Using Optical Flow

To deal with frame-to-frame inconsistency, we’ll use an algorithm derived from one published in Simple Baselines for Human Pose Estimation and Tracking Section 3.3 (“The Flow-based Pose Tracking Algorithm”).

The idea is to use Non-Maximum Suppression to unify bounding-boxes generated by optical flow (propagating HPE keypoints from an earlier frame) and an object detector (YOLOv3 box on current frame). This provides a number of benefits in a video context, such as counteracting missing frame-specific detections, smoothing out pose estimated across frames, and preventing a lot of awkward issues (e.g. detecting different people in different frames).

Here’s the relevant bit of the procedure from the paper:

And here’s a Python implementation with the tedious bits stripped out:

Cool, now that we have some tolerable 2D HPE keypoints and bounding-boxes, let’s move to 3D HPE.

3D Pose Estimation

For 3D HPE, I ended up using the model presented in, “Semantic Graph Convolutional Networks for 3D Human Pose Regression.” Since its designed for the task of 2D to 3D human pose regression, we’ll be feeding our 2D detections from the previous stage in as input.

3D Pose Regression from 2D Pose Estimate

Notably, since this model is meant for single images, it’s unable to leverage the temporal dimension available in a sequential context (video).

In hindsight, I really should have made more fundamental changes in order to incorporate temporal information earlier, since here I once again had to invest significant time to solve many of the same problems encountered earlier with 2D HPE.

I tried some creative approaches before ultimately settling on a basic sliding-window procedure over frame sequences to smooth noise out. There’s a little bit of caution needed here: tennis is full of quick, explosive movements and we don’t want to smooth out relevant signals.

Since many of the challenges here are essentially duplicated from those covered in the 2D HPE video-processing stage, I’ll cut the recounting short.

Still, we finally have 3D HPE keypoint coordinates! Now we need to convert these coordinates into usable inputs for action classification, so that we can start to classify a given video of a stroke (forehand, serve, volley).

First, though a quick tangent:

A Diversion: Exploring 6D Pose Estimation

Although I didn’t end up using it, I considered replacing elements of human pose estimation with object pose estimation (tennis rackets, in this case) in an effort to simplify the overall process. Object pose estimation in this context would imply estimating the position and orientation of someone’s tennis racket.

The “6D” aspect of “6D pose estimation” refers to the idea of estimating both an object’s position and orientation in a 3D context. I was specifically intrigued by a paper I saw, “Augmented Autoencoders: Implicit 3D Orientation Learning for 6D Object Detection,” for two reasons:

It described a process to generate synthetic training data via simulating a given 3D object model (e.g. a CAD model) in a large variety of noisy images in different orientations and positions. Not having to worry about acquiring and labeling a dataset was a big potential win.
I’d used Denoising Autoencoders before to impute missing values (not a great reason).

Accuracy was ultimately too poor (not surprising given synthetic data generation), here’s a bit of what the training process looked like:

3D CAD model of a tennis racket to use for generating a training dataset

Synthetic data generated for training, left - 3D model in noisy image, right - ground-truth position & orientation

Ok, let’s get back to turning our 3D HPE outputs into something useful.

Skeleton Posture-Motion Feature Action Maps

Right, so we want to try to normalize our 3D HPE sequences a bit, so that our action classification framework / feature extraction doesn’t need to deal with varying sequence lengths (sequences with different frame lengths).

We could try padding/compressing frame-sequences in order to normalize input lengths, but there’s a big risk of introducing some form of distortion.

In the process of searching for a solution, I stumbled onto this paper: Spatio–Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks which provided a pretty neat approach, encoding a sequence of 3D poses into a single RGB image (Skeleton Posture-Motion Feature action maps; SPMF action maps) capturing temporal and spatial dimensions. Incidentally, this also reduced the dependence on my custom smoothing procedure, as the proposed encoding scheme also applies a smoothing function to different predictions.

Generation of an SPMF, Source

An example SPMF

Alright, now let’s explore trying to get some useful data out of this magic image.

Action Classification and Recognition

My plan to build out action classification was to use Deep Convolutional Neural Networks based on the DenseNet architecture, essentially the approach taken by the earlier SPMF paper. From there, I wanted to apply some form of feature extraction procedure to extract specific stroke-level features, such as: point-of-contact, follow-through, and spin. With these features I hoped it would be possible to give some form of useful feedback.

However, there was a big difference in this phase compared with all earlier stages: no datasets! There were no annotated, publicly available tennis video datasets with stroke-level features (point-of-contact, follow-through, etc.). Moreover, even datasets that did have videos such as Kinetics700, only had a limited number of videos (1144 labeled as “playing tennis” for kinetics700).

I also checked out some other datasets:

Olympic Sports Dataset

Only Tennis Serves

UCF101 - Action Recognition Data Set

Tennis swings

HACS: Human Action Clips and Segments Dataset

No luck

The YouTube Sports-1M Dataset

Has tennis label

Sports Videos in the Wild (SVW): A Video Dataset for Sports Analysis

Has tennis label

So, now I was in a situation where accuracy for non-trivial features was essentially terrible.

Of course, I had known that accessible data would be more sparse as the relevant domain narrowed, but I massively underestimated the investment acquiring and labelling a useful training dataset implied.

Trivial features which were pretty easy to extract:

Right-side
Left-side
Two-Hand
One-Hand
Follow-Through over Shoulder

Non-trivial features which I struggled to achieve reasonable accuracy with:

Topspin
Hitting up on the ball
Volley
Swinging Volley

Unpleasant Decisions

At this point, I had to make some decisions. Accuracy and feature extraction were proving to be really difficult tasks without access to deep pre-existing datasets.

I had to decide between:

Dedicating a big chunk of time to create a truly useful product and hoping to find product-market fit.

I couldn’t justify working on this for a year only for myself, so if it kept going I needed to try to make it useful for others.

Letting this go and leaving it as a fun experiment.

Ultimately, I chose to leave this work behind, for the following reasons:

I lacked experience coaching others and didn’t have enough exposure to a diverse set of coaches to be confident in my understanding of providing meaningful, generalized tennis feedback.

Also individual differences in “correct” strokes implied that any feedback would likely need to be comparison-based.

The data labeling/annotation problem implied a huge upfront cost (at one point I was trading Python lessons for my friends in exchange for data labeling!); I noticed competitors hiring for full-time data labeling roles, suggesting that this wasn’t a solved problem (they’re still hiring for those roles!).
Lack of conviction in the market opportunity; talking to club players, coaches, and doing general market exploration gave me many reasons to take pause. My impression was also that there was a big long-tail of features needed to achieve perceived parity with existing solutions.

The higher end of the market is also exposed to more high-margin solutions, e.g. multi-camera fixed position HD recording setups with costs running up to 5-figures

My cofounder, who I brought on towards the tail-end of this work, also challenged my thinking and forced me to question some of my assumptions. We ended up pivoting to building some other stuff together.

Final Thoughts

Even though I would say overall that my approach was super crude and wasteful (there are now also mobile APIs for pose estimation), it was super fun to dive into a challenge like this; reading research papers isn’t boring, at least when there’s something actionable you’re looking to extract.

Some thoughts:

Got invited to some startup accelerators and was offered opportunity to white-label my framework, but I didn’t want to undertake long-term commitments, particularly alone.
Got to talk to a lot of interesting people in the process of building this, mostly via user interviews, so that was pretty cool.
Gained a new dislike for big python projects which use numpy and throw around a bunch of magic numbers in big blocks of uncommented code.

On the other hand, working out affine transforms on my whiteboard was kind of fun.

From experience: if you do similar work, don’t carelessly forget to turn off a p2.xlarge instance for an extended period of time.

Lastly, if you want to see what automated feedback with pose estimation can look like when executed well check out SevenSix and Onyx.

Towards Automated Tennis Coaching