Here are links to in-progress, open source software projects that I am involved in. Some are in support of my research, others general utilities.
Taxtastic is a collaboration with Erick Matsen and others in the FHCRC Computational Biology group. This project provides software to perform sequence-based taxonomic assignment of microorganisms using a phylogenetic approach. It is being developed in parallel with Erick's fantastic pplacer, which adds aligned sequences to a ML phylogenetic tree. The taxit command line utility assembles a sequence alignment, phylogenetic tree, and sequence annotation into a reference package (refpkg) that can be used with pplacer.
Code and documentation are hosted by github: https://github.com/fhcrc/taxtastic
Reproducible research with bioscons
The term reproducible research (in my mind, anyway) can be summarized as "knowing what you did, and being able to do it again efficiently." Some folks here go into more detail on the concept. The goal of this project is to provide a framework for creating analytic pipelines that are self-documenting, repeatable, and sensitive to changes in source code and data. bioscons is a thin wrapper around the build tool scons. From the website:
SCons is an Open Source software construction tool—that is, a next-generation build tool. Think of SCons as an improved, cross-platform substitute for the classic Make utility with integrated functionality similar to autoconf/automake and compiler caches such as ccache. In short, SCons is an easier, more reliable and faster way to build software.
Scons provides a python-based scripting environment that is easy to extend to applications in bioinformatics by writing wrappers for commonly used bioinformatics tools. Why might you want to do this? Well, say you are building a phylogenetic tree of bacterial 16S rRNA sequences. The steps involved might be something like this:
- identify sequences in a database
- download the sequences
- reformat the sequence files
- perform a multiple alignment
- build the tree
Each of these steps may be scripted, but it is extremely unlikely that such a script will be run just once. Code to perform each step may need to be developed, parameters selected, source data updated. Using bioscons, the objective is to create a build script that will be sensitive to changes in dependencies so that the entire pipeline may be reproduced by running only the necessary steps. This has benefits both in development and later when preparing for publication, responding to reviewers, and just being able to faithfully reproduce what was done. This is not the first project to use scons for reproducible computational research (see for example the Madagascar project for an example of a well-developed platform for research in image analysis), but the emphasis here is on bioinformatics.
Code and documentation are hosted by github: https://github.com/nhoffman/bioscons
Classification by local similarity threshold (clst)
I have written two R packages, clst and clstutils that implement a type of distance-based classification that I have been developing to perform taxonomic assignment of 16S rRNA gene sequence reads. These are available from the Bioconductor website:
This package contains many utilities for biological sequence manipulation that I have accumulated over the years. The first thing to point out is that there is much duplication of functionality provided by the biopython project. If you need to parse various sequence file formats (fasta, genbank, etc), you should probably go there instead. However, there are a few goodies in here that I haven't found elsewhere. For now it is a dependency for several other projects until I transition to biopython.
Code and documentation are hosted by github: https://github.com/nhoffman/Seq
My emacs config (.emacs.d) is on github if you are interested in such things.