Finding records in PhysioBank using standard command-line tools

The new PhysioNet website is available at: https://physionet.org. We welcome your feedback.

This is not the user-friendly search tool that you are looking for! We are developing such a tool, which will encapsulate the procedures described below within a web-accessible GUI, and we will post a pointer to it here as soon as it's ready to be used.

If you don't mind a bit of typing, however, this page describes how to use grep, cut, and uniq to find records in PhysioBank that have desired features (such as specific combinations of signals). If you are familiar (or become familiar) with other standard command-line tools for text manipulation such as sort and join, you will be able to do much more.

The necessary command-line tools are standard components of Linux, Mac OS X, and all other Unix and Unix-like platforms; Windows users can get them by installing Cygwin.

The PhysioBank Index

All records in PhysioBank that can be viewed by the PhysioBank ATM (nearly 30,000 as of June 2011, from over 50 collections) are indexed in this text file:

physiobank-index (392M, last updated Wednesday, 17 April 2019)

Each line of the PhysioBank Index describes one signal, annotation file, or other feature of a single record; there are about 420,000 lines in the Index. All lines pertaining to any given record are consecutive, and the records appear in dictionary order. The first few lines of the Index are:

aftdb/learning-set/n01	ECG1	ECG	128	142.045 adu/mV	60
aftdb/learning-set/n01	ECG2	ECG	128	143.062 adu/mV	60
aftdb/learning-set/n01	AnnM1	qrs	128	76	60	0-60
aftdb/learning-set/n01	AnnR2	qrsc	128	76	60	0-60
aftdb/learning-set/n02	ECG1	ECG	128	202.429 adu/mV	60
aftdb/learning-set/n02	ECG2	ECG	128	202.429 adu/mV	60
aftdb/learning-set/n02	AnnM1	qrs	128	73	59	1-60
aftdb/learning-set/n02	AnnR2	qrsc	128	73	59	1-60

The lines above describe records n01 and n02 of the collection named aftdb/learning-set. (The file DBS contains short descriptions of each collection; aftdb is the AF Termination Challenge Database, which contains a learning set and two test sets of records.)

Each line of the Index contains up to seven tab-separated columns that describe a signal, annotation set, or feature associated with the record. For lines describing signal and annotation sets, these columns are (from left to right):

Record name
Class
Signal or annotator name
Sampling frequency (Hz)
Gain (adu per physical unit), or number of annotations
Duration (in seconds)
Time intervals during which samples* or annotations are present (in seconds)

Lines describing features are not present for all records; they are described below.

Class is the category of data: either a category of signals (defined in sigclasses), a category of annotations (either AnnM for machine-derived annotations, or AnnR for reference annotations), or a category of features associated with the record (either AgeSex, Med [medications], Diag [diagnoses], or Info [other information about the subject or the recording]). A sequence number is affixed to each instance of the class if more than one instance is possible in a single record (e.g., ECG1, ECG2, etc.); this is done even if only a single instance is actually present.

An adu is one analog-to-digital converter unit (the quantization step, which is the smallest measurable difference between samples). An amplitude resolution of 20 adu/mmHg means that two unscaled samples that differ by 20 units represent a pressure difference of 1 mmHg.

* In most cases, signals are present throughout, and the last column is omitted. The MIMIC II Waveform Database is an exception to this rule.

Typical record feature lines appear below:

iafdb/iaf1_afw	Diag1	Atrial Fibrillation
iafdb/iaf1_afw	Meds1	Atenolol, Monopril
iafdb/iaf1_afw	Info1	 Adenosine injected at 70 sec
iafdb/iaf1_afw	Info2	 Note: signals are uncalibrated
iafdb/iaf1_afw	AgeSex	81	F

As for the signal and annotation set lines, the first two columns are the record name and class (data type). The first four feature lines shown above illustrate diagnoses, medications, and two lines of free-text information; the data appear in the third column. The final feature line contains the age (in years) in the third column, and the sex (M, F, or ? in the fourth column). If the subject's age is over 89, it is shown as 90 (since ages over 89 are protected health information); if the age was not recorded, it is shown as -1. Ages of infants less than 1 year old may be shown as 0, or as a decimal fraction of a year (e.g., 0.3).

Using the PhysioBank Index

Begin by downloading the Index from the link above.

Open a terminal emulator window and navigate to the directory in which you saved physiobank-index.

There are five records in PhysioBank that include a left ventricular stroke volume signal, which is labelled SV. Finding them is simple: type

grep SV physiobank-index

and the results appear in your terminal window quickly:

slpdb/slp59	SV	SV	250	7.93846 adu/ml	14400
slpdb/slp60	SV	SV	250	7.90293 adu/ml	21300
slpdb/slp61	SV	SV	250	958.995 adu/cc	22200
slpdb/slp66	SV	SV	250	9.957 adu/ml	13200
slpdb/slp67x	SV	SV	250	5.25615 adu/ml	4620

If you were looking for such recordings, you would now know where to find them by looking at the record names in the left-hand column.

Getting (re)acquainted with the command line

If you've ever used any version of Unix, or even MS-DOS, the examples on this page should not look strange. If they do, consult any introductory book or on-line tutorial about Unix or Linux. Here are a few places to start:

The necessary command-line tools are standard components of Linux, Mac OS X, and all other Unix and Unix-like platforms; Windows users can get them by installing Cygwin.
The PhysioNet FAQ includes basic information about standard input and output, I/O redirection, and pipes, powerful and easily-understood concepts that are useful whenever working on the command line.
On-line tutorials, such as Working with Data or Command Line Essentials: Text and Pipeline, provide more examples of the use of the tools shown here.
After nearly 30 years, Kernighan and Pike's The Unix Programming Environment remains the best introduction to this approach of tackling problems using tools that each do one job well, and work well together. Used copies are far less expensive than new.

If we want to find records that have at least 3 ECG signals, we can look for ECG3:

grep ECG3 physiobank-index

This results in a very long list of records that quickly scrolls off the screen. If we want to know how long the list is, we can use wc to count the lines:

grep ECG3 physiobank-index | wc -l

(The pipe symbol, '|', connects a pair of commands; it means "take the standard output of the command on the left and feed it to the standard input of the command on the right".) When this page was written, there were 6519 recordings with at least 3 ECG signals in PhysioBank. We can save the entire list by redirecting the standard output into a file, like this:

grep ECG3 physiobank-index >ECG3-records

The '>' collects the standard output of the command, which would otherwise be shown in the terminal window, and saves it in a file (ECG3-records).

Suppose what we really want are the longest such recordings. Here's how to find the 5 longest cases:

grep ECG3 physiobank-index | cut -f 1,6 | sort -nr -k2 | head -5

(This command uses pipes to chain four commands together, each one reading the output of the previous one; cut selects the first and sixth fields from each line output by grep; sort rearranges the lines in reverse numerical order of the second field output by cut; and head discards all but the first five lines output by sort.) The output lists 5 recordings, each containing over 400 hours of ECG3:

mimic2db/a46013/a46013	1815699
mimic2db/a44012/a44012b	1608251
mimic2db/a40308/a40308	1577319
mimic2db/a44261/a44261c	1531616
mimic2db/a44267/a44267b	1527637

There is a caveat, however: these recordings are all from the MIMIC II database, and the signals are not necessarily continuous; in fact, they may not even be simultaneously available. To find a set of long records with at least 3 continuous, simultaneous ECG signals, we can exclude the MIMIC databases and the similar Challenge 2009 database from the search:

grep ECG3 physiobank-index | \
 egrep -i -v "mimic|challenge/2009" | \
 cut -f 1,6 | sort -nr -k2 | head -5

(Here the \ characters indicate the command continues on the following line.) The results are:

ltstdb/s30691	85860
ltstdb/s30731	85845
ltstdb/s30801	85821
ltstdb/s30741	85800
ltstdb/s30752	85736

These somewhat contrived examples illustrate the flexibility of using standard command-line tools to search within the PhysioBank Index. If these tools are already familiar, it's easy to perform much more complex searches, including many that would be very difficult to perform using a relational database and SQL.

Questions and Comments

If you would like help understanding, using, or downloading content, please see our Frequently Asked Questions.

If you have any comments, feedback, or particular questions regarding this page, please send them to the webmaster.

Comments and issues can also be raised on PhysioNet's GitHub page.

Updated Monday, 13-Jul-2015 20:52:50 CEST

PhysioNet is supported by the National Institute of General Medical Sciences (NIGMS) and the National Institute of Biomedical Imaging and Bioengineering (NIBIB) under NIH grant number 2R01GM104987-09.