RE: Naming convention of 3700 files

From: Todd Smith (todd@geospiza.com)
Date: Sat Feb 17 2001 - 18:01:21 EST


Hi All,

I've been following the file naming thread and thought I'd add our
experiences (and solution) in dealing with this stuff.

First, its a mess. Second, as James points out, different users want to
identify samples in different ways for different purposes, so its a
complicated mess. Third, as Paul and Jeremy point out, many labs with 3700s
and 3100s have learned ABI (AB, Applera, PE - I'll stick with ABI) thought
they knew how folks wanted to name things - because that's how some genome
centers do it, so the mess is out of control.

What happens in the ABI software (3700 and 3100) is that a file name is
build from the sample name (in the plate record - aka sample sheet), the
plate name, well coordinates in the plate that, and the capillary number
that the sample went through. Too bad this has no relation my rack of
eppindorf tubes in the freezer.

Unfortunately, in the chromatogram file, only the sample name is tracked.
The sample name from the plate records and sample sheets are added to a
field in the chromat. Thus, each file can be identified in two ways - the
sample name and file name. The sample name is preserved because it is
stored in the file. The file name on the other had can be changed (and is
configurable, see bottom of the message).

Folks who rely on the file name for sample tracking life a precarious and
adventurous life because file names can be changed and tracking information
is lost. For example, an issue we run into a lot is that the plate (or run
name) is not stored anywhere in the file - so it is lost as soon as chromats
leave the NT workstation.

The challenge for a core lab is that they need to accommodate different
naming conventions for different folks and purposes. Core labs are also
getting new instrumentation whose software has a different view of data
management than previous instrumentation creating the following situation.

The process: a scientist submits a sample to a core lab and fills out a form
to give that sample a name that they understand. The core lab may use that
name and propagate it through their system or they may wish to use anonymous
IDs. Anonymous IDs are advantageous in labs where all samples need to be
uniquely identified (can't have a gel or cap. run with identical sample
names), or sample information must be preserved for long periods such as in
clinical and corporate labs, or labs engaged in long term projects.

After the data are collected each file is then given a file name - used to
be the sample name, now something more complicated. At the end of the day,
we have three identifiers for a raw data file: the researcher given name,
the lab ID, and the file name. If a lab manages data with excel spread
sheets or on a file system, maintaining this information quickly gets out of
hand. Perl scripts and other file renaming utilities are used, but mistakes
can be propagated rapidly an in high volume. Further, programs like phrap
impose file naming schemes based on concepts like clone, sequencing
chemistry, and primer orientation, so additional information may be needed
to build a correct filename. In those cases one needs to open many files,
read ABI codes, and build the file names.

The better solution is to store information in a database - commonly an
RDBMS. The Finch-Server for example, tracks all references samples that are
sequenced. We call the scientist given name the "Label", the name from the
sample sheet or plate record the "Sample name", and store the ABI filename.
A unique sample or chromatogram ID is associated with each raw data file so
the integrity of these information are preserved. Since data are maintained
in an RDBMS, one can export data to local filesystems with any identifier.
One can also create new file names based on data attributes stored in the
ABI files.

Finally, the file naming convention imposed by the 3700 software can be
edited. There is a text file in DriveD:\Perkin-Elmer\ABI\DataExtractor
called samplename.txt. You need to change the line:

FORMAT=PLATEID_WELLID_SAMPLEID to
FORMAT=SAMPLEID

This will remove the plate and well id from the names, the capillary number
will still be added because in the plate record (using a fill down) it is
easy to give all samples the same name. The capillary id is needed to write
these to the file system for obvious reasons.

Thanks for reading.

Todd

Todd M. Smith, Ph.D.
President
Geospiza, Inc.
www.geospiza.com

-----Original Message-----
From: Association of Biomolecular Resource Facilities
[mailto:abrf-request@aecom.yu.edu]On Behalf Of James VanEe
Sent: Friday, February 16, 2001 6:07 AM
To: Recipients of ABRF List
Subject: Re: Naming convention of 3700 files

At 12:25 PM -0500 2/15/01, Jeremy Medalle wrote:
>>...The next step
>>is to copy the commands into Apple script and run the script.
>>
>>Please note that the embedded name will not change.

I don't have anything constructive to add, but that bugs me and
always has. If you're doing any analysis that looks in the file and
depends on the embedded name for downstream processingm (I can think
of lots of examples), it'll mess things up royal. Like everything
else, the fact that multiuser cores are, well, multiuser, compounds
the problem because we don't know which of our customers might care...

At 1:04 AM -0500 2/16/01, Paul Morrison wrote:
>I would invite anyone else who has scripts/work-arounds for naming renaming
to
>meet at JeremyÌs poster also. One would think that besides Apple
>Script, (which
>would be my favorite solution because I know one when I see one),
>there might be
>some decent scripting on the NT side.

I'll be there :-) I'm especially interested in the windows scripting
hosts because like Paul, applescript is home.

-James
-------------------------------------------------------------------
James VanEe Phone: (607) 254-4862
BioResource Center
Computing Facilty
170/171 Biotech Bldg Fax: (607) 254-4847
Cornell University
Ithaca, NY 14853 www: http://brcweb.bio.cornell.edu
-------------------------------------------------------------------



This archive was generated by hypermail 2b29 : Fri Feb 23 2001 - 13:03:34 EST