Creating PhysioBank (WFDB-compatible) Records and Data Collections

The new PhysioNet website is available at: https://physionet.org. We welcome your feedback.

If you have digital recordings of signals or time series, perhaps with annotations, that you would like to study using PhysioToolkit software such as that in the WFDB software package, or that you would like to contribute to PhysioBank, the information on this page should get you started on creating PhysioBank-compatible records from your data.

Many data formats are WFDB-compatible; there is no single "WFDB format". This tutorial will help you determine if your data are already in a WFDB-compatible format, and to choose a suitable WFDB-compatible format if they are not.

If you haven't done so already, install the WFDB software package before continuing.

Also available is the WFDB Matlab Toolbox, a Matlab implementation of the WFDB software package.

Terminology

The basic component of a PhysioBank data set is a record, which consists of data that describe a single subject, simulation, or experimental run. Typically, a record contains one or more signals and one or more sets of annotations, together with header information (described below).

In this context, a signal is a time series of measured or calculated samples separated by uniform time intervals (sampling intervals). In PhysioBank-compatible records, samples are represented as 8-, 10-, 12-, 16-, 24-, or 32-bit integers.*. The accompanying header information provides, among much else, the parameters needed to convert the dimensionless integer samples into calibrated physical quantities (such as blood pressures in mmHg, etc.).

Support for 24- and 32-bit integer samples was introduced in WFDB version 10.5.0 (March 2010). Previous versions were limited to resolutions of 16 bits or fewer.

The sampling frequency of a signal is the number of sampling intervals per second (which may be less than one for infrequently-sampled signals). In most cases, all signals belonging to a record are sampled at the same sampling frequency. If this is not true, then a frame interval must be defined, generally as the least common multiple of the various sampling intervals used in a record, and the frame frequency is the number of frame intervals per second.

Also in this context, an annotation is a label that "points" to a specific sampling interval (or frame interval) in the record (and optionally to a specific signal as well). Each annotation can have a small number of numeric attributes, as well as either a string or a URL, associated with it. Annotations are commonly used in PhysioBank data collections to label heart beats, and to record observations and events that occur at non-uniform or infrequent intervals.

Sets of annotations are stored in annotation files; for example, an annotation file may contain a set of heart beat annotations, or a set of sleep stage annotations, or both. A record may have any number of associated annotation files, distinguished by suffix. The suffix of the name of an annotation file is the annotator name (or simply, the annotator); usually the annotator name is either the name of the program that created the annotations, or a description of the type of annotations contained in the file. The annotator name atr is conventionally used for reference annotations that have been manually reviewed and corrected.

A complete set of records within PhysioBank is a data collection. Data collections typically represent the data gathered as part of a single research project or study, and it is usually true that all records in a data collection share basic characteristics such as types of signals and annotations. If you plan to contribute a data collection to PhysioBank, the final section of this tutorial describes how to organize a set of records into a collection.

Data collections vs. databases

Most of PhysioBank's existing data collections have names that include the word database. This generic term has come to be widely misused and misunderstood as synonymous with the specific term relational database. PhysioBank data collections are (non-relational) databases, composed of multiple text and binary files, and meant to be read using a wide variety of software (but not relational database management software). Since the term database is likely to cause confusion in this tutorial, we use the term data collection below.

Are your data already in PhysioBank-compatible format?

Many medical device manufacturers have either adopted PhysioBank-compatible formats natively, or provide a means of exporting their proprietary data into a PhysioBank-compatible format. Sometimes, for historical reasons, the term "MIT format" is used to describe a PhysioBank-compatible format. European Data Format (EDF), used widely to store unannotated data, differs from the formats most often used in PhysioBank in that it does not make use of external header files, but it is fully PhysioBank-compatible. The newer EDF+, which is a variant of EDF that incorporates a limited capability for storing annotations, is mostly PhysioBank-compatible (but see this note about EDF+ annotations). BDF and BDF+ (variants of EDF and EDF+ for 24-bit data) are also PhysioBank-compatible.

If you have records that include .hea or .edf files, verify that they are PhysioBank-compatible by trying to read them with wfdbdesc, rdsamp, and (if you have annotation files) rdann.

Most files in common binary formats that use fixed-length samples for storing digitized signals (including many that, like EDF and EDF+, contain embedded metadata at the beginning of the file) are also PhysioBank-compatible signal files. If your data are already in such a format, it may be sufficient to create a header file and, if applicable, an annotation file for each record. See signal(5) for details of supported signal file formats, and header(5) for complete specifications of header file format, with examples.

What's in a record?

Unlike records in relational databases, each PhysioBank-compatible record is stored in its own files. The files belonging to any given record share a record name (usually the initial part of the file names), and are distinguished by suffixes. Record names (needed by WFDB applications to specify their inputs) never include the .hea suffix, but they do include .edf (or other suffixes, if used) when reading EDF (EDF+, BDF, BDF+) files.

For example, record 100 of the MIT-BIH Arrhythmia Database consists of three files, named 100.hea, 100.dat, and 100.atr.

WFDB-compatible records generally contain three types of files:

Header file: This is the only (normally) required element of a record; it consists of a short text file, named with the suffix .hea. A record's header file specifies the duration of the record, and (optionally) its starting time; it also contains, for each signal, its name, storage format, sampling frequency, calibration parameters, and the name of the signal file in which it is stored. Additional information, which often includes the age, gender, diagnoses, and medications taken by the subject, can be included in the header file if available. As noted above, header files are not necessary for EDF, EDF+, BDF, and BDF+ files.
Signal file(s): These binary files generally contain samples only; conventionally, they have names ending in .dat, but this is not required. Records can include a signal file for each available signal, but usually all of the available signals are stored in a single signal file, in which frames containing samples taken from each signal simultaneously are always arranged in the same order and written in sequence. These characteristics permit efficient random access within signal files, since the position of a sample at any given time can be readily calculated. Records consisting entirely of non-periodic observations may lack signal files.
Annotation file(s)
: These binary files contain annotations in a highly compact format that requires slightly more than 16 bits per beat label annotation (more for annotations containing strings or URLs). Many records are multiply annotated, either by different observers or with respect to different attributes, and in such cases, each set of annotations is generally stored in its own file. Files with names ending in .atr are conventionally used to store reference annotations that have been manually checked for accuracy. Since annotations are often created long after the respective signal files, having external annotation files permits them to be added to existing records without a need to replace the usually lengthy signal files. Unannotated records don't include annotation files.

PhysioBank-compatible file names

Files hosted on PhysioNet may have names containing any of these 66 printing characters:

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+_.-

Avoid using spaces or any other characters in the names of PhysioBank-compatible files that you create, since files named with disallowed characters may be unreadable by PhysioToolkit software. Filenames should not begin with '-', and record names, except for EDF/EDF+/BDF/BDF+ records, should not contain '.' characters. The suffixes hea, dat, edf, and bdf should not be used as annotator names. By convention, the names of most files hosted on PhysioNet do not include upper-case characters.

Creating signal and header files

If you don't already have PhysioBank-compatible records, an easy way to make them from the data you have is to begin by creating a CSV file containing one sample of each signal per line, as in this example consisting of samples of two ECG signals:

If you have written your data in this format to a CSV file named foo.csv, create foo.hea and foo.dat using this command:

wrsamp -F freq -i foo.csv -o foo -s, 0 1

replacing freq by the sampling frequency of your signals. The final command line arguments (0 and 1 in the example) specify the columns of the input file that should be written as signals to the output; column 0 is the leftmost, 1 the next, etc. Columns can be omitted, reordered, or duplicated as desired. See wrsamp for details and additional options that can be used if your samples are not 16-bit integers.

Edit the .hea file using any text editor of your choice to insert signal names and physical units, and calibration parameters. For records to be contributed to PhysioBank, please add, at the end of the file, an info string (a comment line beginning with '#') that describes (at a minimum) the age, gender, diagnoses, and medications of the subject (other information that does not identify the subject is also welcome). Example:

# <age>: 35  <sex>: M  <diagnoses>: (none)  <medications>: (none)

Please use this format to permit indexing software to parse this information reliably. This string may extend over multiple lines if necessary, but begin each such line with '#'.

Creating Signal Files from Physical Signals

*For information about physical vs digital signals and file storage, see the FAQ's digital or physical section.

The WFDB software package's wrsamp is designed to write digital values into a signal file. All non-integer input values will be rounded off, so if you input a physical signal of decimals all under 0.5, the output will just be 0's. This is fine if you already happen to have the digital values in text format, but very troublesome if you only have analogue values.

One feature that may help in both instances is the -x option of wrsamp which multiplies each input channel by a specified factor before writing them to the signal file. Do not confuse this with the -G option which only affects writing the header file for interpreting the signal after it has been written. See the wrsamp man page (man wrsamp) for more details.
If you have Matlab, you can use the mat2wfdb function from the WFDB Matlab Toolbox which automatically chooses and applies appropriate gains and offsets on input matlab signals before writing the output WFDB file.

Creating annotation files

If your records include beat labels or other non-periodic observations, they can be stored in annotation files. The easiest way to do this is to put your non-periodic information into the text format produced by rdann; text in this format can be converted into PhysioBank-compatible annotation files using wrann.

If you wish to create annotations for your records, PhysioToolkit offers a variety of software that may be useful, including QRS and pulse wave detectors for locating heart beats in ECG and blood pressure signals automatically, and WAVE and LightWAVE, applications for viewing PhysioBank-compatible data and for interactive creation and editing of annotations.

PhysioBank-compatible annotation files have two-part names of the form record.annotator, where record is the name of the record with which they are associated, and annotator is a name that typically either describes the type of annotations contained in the file (e.g., qrs), or identifies the algorithm (e.g., wabp) or the person who created the annotations. As noted above, don't use hea, dat, edf, bdf as an annotator name, or any other suffix that is likely to be misinterpreted, but there are no other restrictions on annotator names other than those imposed by names for PhysioNet-hosted files.

About EDF+ and BDF+ annotations

EDF+ and BDF+ closely resemble the older EDF and BDF respectively; the major innovation of the newer formats is a specification for embedding annotations within EDF+ and BDF+ files. These annotations are encoded as a signal named "EDF annotations"; since each annotation requires many samples of the annotation "signal", there is a minimum time interval between consecutive annotations. The specification allows only one set of annotations per record.

EDF+ and BDF+ files are, as noted above, mostly compatible with WFDB applications. Current versions of the WFDB library do not read EDF+ annotations directly, however; it is necessary to extract them from the EDF+ file and rewrite them into a conventional PhysioBank-compatible annotation file in order to read them with WFDB applications. This can be done easily using rdedfann(1) and wrann(1).

Creating a data collection from a set of records

To create a PhysioBank-style data collection (repository), given a set of PhysioBank-compatible records, is a very simple process:

Choose a unique descriptive name for your data collection (look at the PhysioBank Archive Index for inspiration, and to avoid duplication). Also choose a short (5-10 character) abbreviation that can be used as a directory name, and avoid the existing short names as shown in parentheses in LightWAVE's and the PhysioBank ATM's database menus. These menus are constructed from the contents of PhysioBank's master DBS file.
Make a data directory, giving it the short name from the previous step, and move or copy into it all of the records (header, signal, and annotation files) from your collection.
Within the data directory, make a plain text file called RECORDS containing a list of the names of the records, one per line. (Example: RECORDS.) There should be nothing else in the file (no comments, notices, etc.) other than the record names. (Remember that, with the exception of EDF/EDF+/BDF/BDF+ records, record names are not file names; see What's in a record? above.)
Within the data directory, make a plain text file called ANNOTATORS containing a list of annotator names. (Example: ANNOTATORS.) Each line should contain an annotator name, followed by a tab character, followed by a brief description (70 characters or fewer, which may include spaces). This file should be empty if there are no annotation files in your collection.
Within the data directory, make a plain text file called README containing a brief description of your data collection. This file should also include acknowledgments of contributors and funding sources as appropriate, and citations of any publications that should be cited in work that makes use of the data.
Within the parent of the data directory, make a plain text file called DBS containing the name of the data directory, followed by a tab character, followed by the unique descriptive name for your collection. The short name should not include any of the parent directories, and should not begin or end with directory separator characters.

If you have more than one collection, make separate data directories, RECORDS, ANNOTATORS, and README files for each collection, but make a single DBS file with descriptions of each collection, one per line.

Since the RECORDS, ANNOTATORS, and DBS files will be used to generate menus for LightWAVE and the PhysioBank ATM, we recommend keeping their contents in alphabetical or numerical order for ease of use.

Finally, in preparation for sharing or backing up your data collection(s), make a portable data repository (a zip archive or tarball containing your DBS file and your data directory or directories). If you wish to contribute your data to PhysioNet, such a repository can be uploaded to PhysioNetWorks, unpacked, and accessed directly using LightWAVE.