We encourage you to set up a mirror of PhysioNet if you wish. If you do so, please use rsync as described below to minimize the impact on other users.
If you simply wish to retrieve a (possibly large) set of files from PhysioNet without downloading them one at a time, you don't need to set up a mirror of PhysioNet; see the PhysioNet FAQ for instructions on downloading an entire PhysioBank database in one step using rsync, or use GNU wget to download any desired selection of files.
PhysioNet is organized in "volumes" that can be mirrored separately. All PhysioNet mirrors provide the "base volume", which contains all of the software, tutorials, reference manuals, publications, and many of the PhysioBank databases. Additional volumes, which are optional for the mirrors, contain very large data sets that don't fit in the base volume. The maximum sizes of the base volume, and volume 2, will not exceed 25 GB each; volumes 3 and 4, 100 GB each; volumes 5 and 6, 2.5 TB each; and volumes 7 and 8, 10 TB each. More than one volume can occupy a single sufficiently large disk partition.
We use and recommend the following configuration for a PhysioNet web server:
- CPU: 200 MHz or faster Intel or AMD x86- or x86_64-compatible, or any other CPU supported by Linux (Alpha, PowerPC, Sparc, etc.)
- RAM: at least 256 MB (512 MB or more recommended)
- Disk: at least 35 GB (10 GB for system software and temporary storage, 25 GB for the PhysioNet base volume, more if any of the optional volumes are to be mirrored).
- Internet connection: 10 megabits/second (T1) or faster if possible
- A registered host name dedicated to the PhysioNet mirror (see below)
- Operating system: Linux, Fedora 14 or later recommended
- Additional software:
- HTTP server: Apache 2.2 or later
- Mirroring software: rsync 3.0.7 or later
- C compiler: GCC, version 4.5 or later
- A "make" utility, such as GNU make, version 3.0 or later
- Scriptable stream editor: GNU sed, any version
- A Unix command-line "mail" utility, such as Heirloom mailx, any version
- Perl interpreter: perl, version 5.0 or later
- Perl modules CGI.pm and Readonly.pm, from CPAN or your Linux distribution (see below)
- A "timeout" utility, such as GNU timeout, from GNU Coreutils, any version
- Image converter: convert, from ImageMagick version 6.6 or later
- A "which" utility, such as which, any version
If you plan to install and run the PhysioBank ATM on your mirror, we recommend a CPU with a clock speed of 1 GHz or faster, 512 MB or more RAM, and at least 10 GB of additional disk space for the ATM cache. Most PCs built since 2004 meet or exceed these recommendations.
All of the software needed, including the operating system, is freely available open-source software. The cost of suitable hardware can be under US$300; of course it is possible to spend considerably more. If you cannot or do not wish to run Apache under Linux, many other configurations are possible, but we will not be able to help you troubleshoot your setup. Other versions of Linux or Unix, including Mac OS X, should be usable without difficulty. Although we do not recommend or support MS Windows, versions of all of the necessary software are freely available for MS Windows as optional components of Cygwin. The remainder of these notes assume that you are using Fedora 14.
Currently, 100 GB to 3 TB SATA drives are widely available and are usually least expensive per gigabyte. (4 TB drives are beginning to appear at higher cost per gigabyte.) IDE (PATA) drives can also be used in many older PCs. SATA drives, and any drives larger than 127 GB, may require a controller card in PCs made before 2006. Most current PC motherboards include integrated SATA controllers, but many no longer support IDE drives.
Preparing your site
Please let us know if you encounter any difficulties with this procedure!
You will need to have root (administrative) privileges for some of the steps below. Once steps 1 and 2 are complete, the remaining numbered steps can be finished in 10 to 15 minutes.
- If you have not already done so, begin by installing Linux. If you plan to run the PhysioBank ATM locally, we recommend creating a separate 10 GB (or larger) disk partition for the ATM's cache. If you do so, mount this partition as /home/ptmp/, and give it the same permissions as /tmp.
- Install the additional software listed above, following the instructions provided with your Linux distribution. If any of the additional software is not included in your Linux distribution, download it (use the links above, or your favorite repositories) and install it.
- Choose (if necessary, create) a user account that will "own" the mirrored PhysioNet files. Don't choose an account with special privileges, such as root. The rest of these instructions assume that this user's name is "pn". Important: pn's home directory must not be /home/physionet or any subdirectory of /home/physionet.
- Create a directory called /home/physionet, owned and writable by "pn" and readable by anyone. If /home's disk partition does not have at least 25GB free, make a physionet directory on a disk partition that does have at least 25GB free, and make a symbolic link from this directory to /home/physionet.
-
Log in as pn (or as whoever will "own" the mirrored files). While
within pn's home directory, run the command
rsync -a physionet.org::mirror-setup mirror-setup
to verify that you are able to communicate with the PhysioNet master server, and to download a few short files for setting up your mirror, which will go into a subdirectory called mirror-setup within the current directory (e.g., /home/pn/mirror-setup). If mirror-setup doesn't exist already, rsync will create it.
-
Enter the mirror-setup directory by typing
cd mirror-setup
If you wish, read through the various files you have downloaded to see how they work, and then run:
./configure
The first time you run it, configure will ask a few questions, but it will remember your answers (in mirror.conf) and will not ask them again if you rerun it.
- [Optional] Your daily mirror updates will begin at a time when the load on the master PhysioNet server is expected to be low, to limit its impact on other users; configure chooses this time automatically (and tells you what time it has chosen). If that time is inconvenient, change the variables UPHOUR and UPMIN in mirror.conf. In order to avoid delays in your mirror updates, do not reschedule them to begin between 0330 and 0530 UTC.
- [Optional] If you wish to mirror any of the optional volumes, create a
directory for each of them (suggested names: p2, p3, ...), and be sure there is
sufficient free disk space for the files to be mirrored. Warning: Make
sure that these directories do not overlap (don't, for example, create them
within /home/physionet, or within each other). These directories
should be owned by "pn"; they should be world-readable and searchable, and "pn"
needs to be able to write in them (i.e., permissions should be 755).
When the directories are ready, set the variables P2, P3, ... to the names of these directories by editing mirror.conf, and then run configure again.
- As root, run
./install
in order to schedule daily mirror updates, and purging of the temporary/cache directory. Once you have done this, these processes will begin automatically within 24 hours.
- [Optional] If you wish to begin downloading files for the mirror
immediately, you can do so (once again as "pn", not as root)
by running
mirror-update -v
If you do so, do not allow this process to continue while your scheduled daily update is running. It is safe to interrupt this process at any time; if you run it again, it will continue from where it was interrupted.
The -v option (which may be omitted) causes the updater to report each file transfer as it happens.
Depending on your choice of optional volumes in step 8, and on the speed of your Internet connection, it may take several hours (or even days) to retrieve all of the files you have chosen to mirror the first time. Subsequent updates (see below) will be much faster; if your mirror is reasonably up-to-date, the mirror script will typically finish in 1 to 10 minutes.
Rarely, a daily update may require an unusually large download, as when a large amount of data have been added to the PhysioNet server, or if you add previously unmirrored volumes to your mirror. By default, your daily update will stop after 2 hours, and the next daily update will continue where the previous one ended.
When the update is completed, mirror-update sends a report to physionet.org, which it will digest and incorporate into the PhysioNet Mirrors page. The report lists the URL and geographic location of your mirror, and the time at which it was most recently updated, to help PhysioNet visitors choose a suitable mirror.
In order to give you a chance to test and make any necessary adjustments to your new mirror site, it will not appear on the Mirrors page until it has been running for a few days and has been updated at least once.
Running the PhysioBank ATM on your mirror
If your mirror meets the recommended requirements above (1 GHz or faster CPU, 512 MB RAM, and 10 GB spare disk space), you can run the PhysioBank ATM locally. (Otherwise, visitors are redirected to the PhysioNet master server if they follow links to the ATM.) If you wish to run the ATM, first allow your mirror to complete at least two daily updates. (This will ensure that your mirror's working copies of the WFDB software needed by the ATM are up-to-date.) Then, in this directory, run this command as root:
./enable-ATM
This command installs plt from PhysioToolkit if necessary, and then creates the ATM's cache directory. The ATM will begin working locally as soon as the cache directory has been created.
Test the ATM to verify that it is working properly. If it is not, please disable it until the problem can be corrected, so that your mirror's users can have ready access to the ATM services on the master server. To turn off local ATM services for any reason, return to this directory and run this command as root:
./disable-ATM
This command removes the ATM's cache directory, disabling local ATM services.
Search engines and robots.txt
Once your site is listed on the Mirrors page (or linked to from any other public web page), the web spiders of the major search engines, such as Google, will begin indexing it. Typically these spiders will consume a significant amount of bandwidth when they first visit your site, but this will decrease to a much lower amount once your site is fully indexed. You can avoid almost all of this traffic if you wish (for example, if your mirror shares a network connection, or if your total monthly throughput is limited or metered by your ISP). Unless the network bandwidth consumed by the spiders is a problem, don't do this (it is useful if users can find pages on your mirror using a search engine, after all).
You can exclude most web spider traffic by modifying /home/physionet/html/robots.txt. Before doing this, modify your /usr/bin/mirror-update script by changing the line that reads:
rsync $RSOPTS physionet.org::physionet /home/physionet
to
rsync $RSOPTS --exclude robots.txt physionet.org::physionet /home/physionet
(This change will prevent the daily updates from replacing your customized robots.txt with the original one.)
Next, edit (or replace) /home/physionet/html/robots.txt so that its contents are:
User-agent: * Disallow: /
Make sure that the edited version has the same ownership and file access permissions as the original (owned by "pn", readable by anyone).
The robots.txt protocol is advisory, not mandatory, so making this change may not eliminate all traffic from web spiders, but it should greatly reduce that traffic at the very least.
Maintaining your mirror site
Very little if any maintenance is required once your mirror site has been established as described above. Your web server will begin generating access logs, which will be rotated periodically and eventually discarded. If your logs are stored on a file system with little free space, you may need to clear them manually if your log file system fills up. (The names and locations of the logs are usually specified in httpd.conf. Under Fedora Linux, rotation and disposal of old log files is handled by the logrotate utility, run periodically by cron; no special setup is required unless you have renamed the log files in httpd.conf, or if you wish to keep old log files for more than four weeks.)
PhysioNet itself is growing, and you should occasionally check to see that your mirror site has room to grow with PhysioNet. Given that the cost of a gigabyte of disk storage is continually decreasing as density and speed are continually increasing, there is little reason to purchase more storage than you will need in the next six months to one year.
If you wish to begin mirroring an optional PhysioNet volume, and you have not previously mirrored any of the optional volumes, it is best to fetch a fresh copy of the PhysioNet mirror kit (see above). Edit mirror.conf (which will not have been altered by fetching a fresh kit), and run configure and install again as above. If you do this, please avoid changing your mirror's hostname, location, and maintainer as recorded in mirror.conf, so that your e-mail notifications will continue to be properly recognized by the master PhysioNet server.
If you wish to move a mirror to another host, or simply to discontinue mirroring for whatever reason, use crontab -e to edit your crontab and remove the mirror-update command line. Your mirror will be removed from the Mirrors page after a few days of inactivity.
If you would like help understanding, using, or downloading content, please see our Frequently Asked Questions. If you have any comments, feedback, or particular questions regarding this page, please send them to the webmaster. Comments and issues can also be raised on PhysioNet's GitHub page. Updated Friday, 07-Oct-2016 22:17:24 CEST |
PhysioNet is supported by the National Institute of General Medical Sciences (NIGMS) and the National Institute of Biomedical Imaging and Bioengineering (NIBIB) under NIH grant number 2R01GM104987-09.
|