George B. Moody
Harvard-MIT Division of Health Sciences and Technology
A common problem in signal analysis is to sort waveforms into groups by morphology (shape). Principal component analysis can provide compact descriptions of shape that are minimally affected by noise. Using the QRS complex of the ECG as an example, this tutorial presents practical methods for principal component analysis of waveforms, including software that can be used as is or customized as desired. These methods can be applied to other waveforms within the ECG as well as those contained in many other signals.
For additional discussion of the methods
described here, see
GB Moody, RG Mark. QRS morphology representation and noise estimation using the Karhunen-Loève transform. Computers in Cardiology 16:269-272; 1989. Please cite the above publication when referencing this material, and also include the standard citation for PhysioNet: Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and Physionet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101(23):e215-e220 [Circulation Electronic Pages; http://circ.ahajournals.org/content/101/23/e215.full]; 2000 (June 13). |
Introduction
The goal of this exercise is to produce a program which, given a WFDB record containing an ECG signal and an annotation file containing QRS times, produces as output a list of QRS feature vectors. Many approaches are possible, but this exercise focuses on principal component analysis. In order to complete the exercise, you will need to:
- install necessary software
- decide what the pattern vectors will be (define the pattern space)
- select a training set of data and determine the distribution of pattern vectors within the pattern space
- derive the covariance matrix of the distribution and its eigenvectors (the principal component basis functions)
- decide how many principal components should be used in each feature vector (define the feature space)
- implement a method for deriving feature vectors, and observe how the feature vectors are distributed throughout the feature space.
This directory contains sources for several programs that can be used to work through the exercise:
- defs.h
- definitions for data structures shared by the programs below
- makepv.c
- reads an annotated WFDB record and writes pattern vectors to the standard output
- makecv.c
- reads a file of pattern vectors and writes the covariance matrix to the standard output
- makeev.c
- reads a covariance matrix from the standard input and writes a sorted list of eigenvectors to the standard output
- makefv.c
- reads a list of pattern vectors and a list of eigenvectors, and writes a list of feature vectors to the standard output
There is also a Makefile that can be used to automate the process of compiling these programs (to use it, you will need to have some version of the make utility available under all versions of Unix and Linux, and also included in the free Cygwin development environment for MS-Windows).
Preparation
Download and install the WFDB Software Package if you have not already done so.
Make a working directory for this exercise, and save copies of defs.h, makepv.c, makecv.c, makeev.c, makefv.c, and Makefile in it.
Define the pattern space
The pattern space is the space in which your pattern vectors are embedded. Pattern vectors are the initial numerical descriptions of the items (QRS waveforms, in this exercise) to be analyzed. Depending on the application, the components of pattern vectors might be raw samples, or they might be filtered, baseline-corrected, or otherwise preprocessed. In other applications, a pattern vector might be a collection of any measurements of the items to be analyzed.
Begin the exercise by choosing the number of elements in your pattern vectors, and insert this number in defs.h (find and change the definition of PVDIM). Don't worry about setting the number of elements in the feature vectors yet. How do we choose a value for PVDIM? Usually, we want this number to match the number of samples within a typical waveform of the type we wish to characterize. We'll be using waveforms from the MIT-BIH Arrhythmia Database for this exercise, and these are sampled at 360 samples per second per signal. A typical QRS complex lasts 80 milliseconds, or about 30 samples at 360 samples per second, so you might set PVDIM to 30. Of course, you don't need to use every sample from the QRS complex in your pattern vectors -- you might have a better idea.
Edit makepv.c so that it derives the pattern vectors you wish to use, and compile it. (Since makepv reads PhysioBank data, it must be compiled and linked with the WFDB library, included with the WFDB Software Package. If you don't know how to do this, see the Makefile for this tutorial, or see the WFDB Programmer's Guide.) Note that makepv prints `tagged' pattern vectors, i.e., the annotation times and types are copied from input to output. In most practical applications, annotation types would not be available, so do not write your pattern-vector derivation code to make use of that information. You may wish to use it as a debugging aid.
Get training set pattern vectors
Run makepv on your training set (see the Makefile for an example of how to do this). Your training set should include at least several hundred beats, preferably much more. Your training set might be the first several hundred beats from the record you wish to analyze; choose one from the MIT-BIH Arrhythmia Database for your first experiment. You may use beats from several different records if you wish to experiment with generating a `universal' set of basis functions. For input annotation files, use either reference (atr) files, or generate annotation files using sqrs or your own QRS detector if you have one. Programs rdann and wrann may be useful in selecting sets of annotations to use for training. (The WFDB Software Package includes sqrs, rdann, and wrann.)
Derive the principal component basis functions
Compile makecv.c (once again, use or refer to the Makefile for guidance) and run the resulting program on the output of makepv. (If it's not obvious how to do so, read this.) You may wish to examine the structure of the covariance matrix before continuing. If the structure of the matrix reveals strongly dependent components, you might consider simplifying your pattern vector in the interest of reducing the computational burden.
Compile makeev.c and run it on the output of makecv. The output of makeev contains the calculated eigenvectors (the principal component basis functions) and the corresponding eigenvalues, which can be used to assess the relative importance of the eigenvectors. Save this output in a file for use later on.
Define the feature space
From the output of makeev, plot the eigenvalues vs. the eigenvector numbers and try to determine a suitable number of feature vector components from this plot. (If you are lucky, you may see a distinct `knee' in the plot, separating the essential from the relatively insignificant eigenvectors.) If your pattern vectors are composed of time-ordered samples, it may also be helpful to plot the basis functions (i.e., the components of each eigenvector); the first few should be recognizably similar to typical pattern vectors, and the last few may look like noise. Edit defs.h again to insert the number of feature vector components you have chosen (find and change the definition of FVDIM).
Derive the feature vectors
Compile makefv.c and run it on your test set. When you run makefv, supply the name of the file containing basis functions (the output of makeev) as a command-line argument to makefv (see the Makefile for an example).
There are several useful ways to look at the output of makefv (the feature vectors). A scatter plot using the first two principal components as the axes is a good start. Make separate plots for normal and ectopic beats and observe how they cluster. You should also study the size of the residual errors: are they larger for ectopic beats? for noisy beats? What happens if the fiducial point is not correctly placed (i.e., if the time given in the annotation file is incorrect by one or several samples)?