A Utility Program for Renaming and Sorting Data Files from ABI Automated DNA Sequencers

Clark T. Riley
HHMI/Johns Hopkins University School of Medicine


When a biomolecular resource facility acquires an automated DNA sequencer, one of the first new realities it will face is the sudden flood of data generated by the new instrument. These data are the currency of the DNA sequencing trade and the only items of interest to most end users. End users can reasonably expect the data files from the DNA sequencer to be:

The labor needed to distribute these data can be staggering, and high-volume DNA sequencing laboratories must devise ways to handle data efficiently and appropriately. Labeling files and looking up information for end users by hand is neither efficient nor appropriate use of laboratory staff.

Fortunately, the sample sheet files used by the Applied Biosystems Division automated DNA sequencers contain all the information needed to sort and distribute DNA sequencer data files. I have produced a simple utility that automates the file sorting process and that also performs accounting functions to assist in billing. I must point out in advance that this program is a simple utility. It is not commercial-grade or bulletproof. There are constraints to its use that must be observed. However, it is very useful and easy to use. The software is being made available to ABRF members with the understanding that one must read, understand, and follow the instructions that accompany the software.

My program, FileSorter, is written for the Macintosh operating system. By dragging the ABI sample sheet file onto the FileSorter program icon, the data and text files from an ABI DNA sequencer are sorted into end user folders contained within principal investigator folders (Figure 1).

(16k)

Figure 1: Results of using the FilSorter utility, as viewed from the Macintosh desktop. In this example, the hard disk "DNA1" contains folders with names corresponding to the initials of principal investigators, e.g., FG, GG, ... PL. The principal investigator folder "MK" has been opened to show the separate folders for each individual in MK's group; and one of these individual folders, "AS", has been opened to show the sorted data files. Note all these filenames begin with the unique identifier "MKAS". As configured, individuals in the MK group can open the MK folder but not other principal investigator folders.

These folders are located on a server computer, from which end users can copy data to computers in their laboratories. FileSorter does not manipulate or create DNA sequencer data filesit only moves and renames them, following instructions entered in the "Comments" field of the sample sheet. To operate properly, FileSorter requires the following:

File sorting is accomplished by copying the folder containing the sample sheet and the DNA sequencer data files to the server and then dragging the sample sheet (on the server) onto the FileSorter program icon (on the remote computer). With normal morning traffic on ethernet, the whole process of sorting and renaming 36 files and performing accounting functions takes about three minutes. For example, if the operator of the DNA sequencer enters "ABCD C. elegans fact 27, 567 bp" into "Comments" at line 23 of the sample sheet, then the data file for lane 23 will be renamed "ABCD C. elegans fact 27, 567 bp" and placed in the CD folder (directory) inside the AB folder (directory) of the server (Figure 1). The text file containing the sequence alone will be labeled "ABCD C. elegans fact 27, 567 bp.seq". FileSorter makes new folders for principal investigators and end users "on the fly", as it needs them; if the folder does not exist on the server already, FileSorter creates and names the folder automatically. Any end users who have access to the CD folder can retrieve this data 24 hours a day, 7 days a week from any computer as long as they can log onto the server.

The accounting functions of FileSorter will automatically generate, in the AB folder, a file named "AB$" in Microsoft Excel TEXT format that lists all sequence files generated for the principal investigator AB. The CD folder will contain a listing, "CD$", of all sequence files generated for the end user CD alone. A master file, "CBDNASeq$" will appear on the server, listing all the files processed. As new data files are copied to the server and sorted, FileSorter adds the names of newly sorted files to the appropriate lists.

In my laboratory we use a Macintosh IIci as a server computer. The server's hard disk is slightly smaller than a recordable CD-ROM (about 630 megabytes), our preferred media for archiving DNA sequence data. With two DNA sequencers in heavy use, we generate enough data to require archiving to CD and purging the server about once every sixty days. We have used Macintosh models IIci, 660AV, 650, and PowerMacintosh 8100 as remote computers to run FileSorter with identical results. We provide DNA sequencing services to about 20 principal investigators and generate about 15,000 data files each year. Every end user is given an individual identity code (ABCD in the example above) and password on the server, and they can change their password themselves from computers in their laboratories. Each principal investigator is assigned a password that enables them to access to all data generated for their group. The initials used to designate principal investigators and end users are assigned by members of my laboratory to avoid conflicts (two individuals with the same identity code on the server, AB in the example above), and usually these are the initials of the individual's name. Identity code conflicts are rare but when they occur, we resolve them by assigning any two non-conflicting alphanumeric characters.

A self extracting archive of the software and detailed instructions (file size 0.5Mb) can downloaded by clicking here or can be obtained by electronic mail from clark_riley@qmail.bs.jhu.edu if your mail system can handle the BinHex (.hqx) format. There are undoubtedly many improvements that could be made to FileSorter, but other current demands promise to make that a slow and uncertain prospect. I am more than willing to share the source code with any programmers who want to extend it.

The author may be contacted at HHMI/Johns Hopkins University, 725 N. Wolfe St., 807 PCTB, Baltimore, MD 21205 -2105.


Return to the The ABRF Home Page


Created: 1st June 1996
Last modified: 12th June 1996