Lab 2

Ïn this lab, we are going to continue with the Python library from yesterday.

During today's lab, we ask that you create a public github repository for your code if you didn't already, and create branches for the different subtasks you do.

CMake

Create a CMakeLists.txt file based on the slides that is able to generate the library you are creating. The CMakeLists file should properly pick up the Boost and Python version as needed to build the library file. You might want to browse the net for documentation on the FindBoost and FindPython CMake modules (as well as the notes in the slides).

When you are done, you are supposed to be able to create a subdirectory build, run ccmake .., generate your makefile, run make, and then load your library in ipython again. If you are up for it, you could look into generating config header files from CMake, and make the actual module name (both the file name and the Python module name) configurable.

Calling PyVCF directly from C++

Boost Python has an object class that can accept any Python object. Rather than adding individual position, we can accept a VCF reader. This reader is an iterable, we know that since we can do the for loop in Python. This means that it implements the specific __iter__ function in Python, which returns an iterator implementing relevant next and get functions. In fact, you can call an arbitrary function on a Python object by writing code like obj("functionname")(arg1,arg2). This kind of makes sense, since Python name binding is very flexible.

Thankfully, there is a class stl_input_iterator<T>. This class can be constructed based on the object and will then start an iteration. For having a matching "end" iterator, you create an empty iterator of the same type.

Each element in the resulting list will be a VCFRecord, which is also represented by a general Python object. You need to access the "POS" attribute explicitly, i.e. record.attr("POS"). Remember there is also an extract template to get a desired C++ type from an arbitrary Python value. You will need to access the "POS" attribute and extract the proper int value.

Add a new method useReader based on this approach. It should iterate over the full reader until the end, internally doing what addPosition did. Test it on a large enough subset of our test file that you can get an accurate timing result. Is it faster than the more pure Python version?

Extra task - using libStatGen

This performance didn't impress anybody. The actual island length computation algorithm is fast, but the reading is painfully slow. Let's go to using more C++ here. libStatGen contains relatively nice C++ APIs for several genomic formats. Look at the classes VcfFileReader, with its methods open, readRecord, and the class VcfRecord. Try to integrate what you need into your CMake file as nicely as possible. Time the end result for a vcf.gz file. Add a new interface where the Python code just passes the file name and your library uses libStatGen behind the scenes. Is this version faster at loading?

From a parallelism perspective, one can note that none of these libraries actually decompress on a separate thread from parsing the text. At least in the Python case, the overhead of parsing might be large enough that it would make a bit of sense to separate these tasks. You can try to see what performance you get with a compressed and uncompressed vcf file stored in /dev/shm, which is basically a RAM disk. (Reading the uncompressed file from the physical disk will be slower for sure.)