Lab 1

Ïn this lab, we are going to create a simple Python library in C++. The basic premise is that algorithms requiring tight loops can easily benefit from being run in a compiled language.

The background is that in genome analysis, one frequently focuses on a set of known variant positions. That is, rather than handling the full genome sequence of each individual, we note that the sequence is identical in most positions, and focus on a specific subset where we have high quality data regarding sequence variations.

Results in this form are frequently stored in so-called variant call format (VCF) files. This is a text-based format, with one line per variant. Today, we will use a Python library to parse vcf files and feed those to our C++ code, on the UPPMAX cluster tintin.

Preparations, loading software

Login with your UPPMAX user account on tintin. This will require that you have an ssh client on your computer. If you don't, investigate how you install one on your machine. For Windows, you can use e.g. PuTTY. You might also want to consider cygwin, which allows you to run many *nix-like tools and libraries in a compatibility layer that makes Windows behave more like MacOS X or Linux. This is in contrast to true ported versions of e.g. Python. It can sometimes get confusing if you have both cygwin-based versions of libraries and truly native Windows versions. Sometimes, a specific library is not available in a true Windows version, while it is possible to build from source in cygwin.

Anyway, if you are using a command line, connect to the tintin cluster with the following command:

ssh username@tintin.uppmax.uu.se

You are now connected to the login node. This is a shared resource that should not be used for long-winding computations. You can do ls, short installations and compilations etc. If you do anything more heavy, you should submit a job or get an interactive shell on a node in the cluster. Getting an interactive shell for one hour can be done like this:

interactive -t 1:00:00 -A g2015053

The two flags specify the time limit and the name of the project allocation we have for this course. You might have to wait for a couple of seconds. Then, you have a normal shell again, but now it is running as a reserved allocation on a specific node.

The tintin cluster is configured with some software right away. If you want to have additional software available in your environment, you can load it. This basically means that the path and library settings are changed to access the new software. In our case, we want to load specific versions of g++, boost, and Python.

module load gcc/4.9.2
module load build-tools
module load boost/1.59.0_gcc4.9.2
module load python/2.7.6

You will get a warning for loading boost. While it's inconvenient, it's OK and expected.

We will be using the vcf package and a current version of ipython.

Preparations, installing software

We will also install a specific Python package, PyVCF. This can be done using the pip tool.

On a machine you control yourself, you can simply run pip install pyvcf. It will then be downloaded and installed in the common directories and work right away. In the shared environment of UPPMAX, you can't do this. However, the solution is rather simple. You can put the libraries in your home directory and refer to those versions instead.

pip install --user pyvcf
pip install --user ipython

This will hopefully succeed. Now you can try it in ipython.

ipython
...
import vcf

This might not succeed. If so, get back to the command line and enter the following.

export PATH=$HOME/.local/bin:$PATH
export LD_LIBRARY_PATH=$HOME/.local/lib:$LD_LIBRARY_PATH

These two lines will make sure that the directories where pip --user puts the results are respected. These settings can be included in the .bash_profile file (there is also a file called .bashrc, and we won't go into details for which one to choose here). You can of course also add the module load lines above there. Now, we have put the basic pieces together.

Testing a simple library with Boost Python

We have taken a "hello world" example for Boost Python and adapted the makefile for our setting. You can clone the path https://github.com/cnettel/Shocksolution_Examples.git (by using https, authentication is not needed). You might also want to fork this repository to later create your own package within this.

git clone https://github.com/cnettel/Shocksolution_Examples.git
cd Shocksolution_Examples/Boost.python/minimal

Look at the different files. As you can see, some paths are hardcoded in the Makefile. The tool python-config can report those, but tomorrow we will be looking at using CMake for finding these paths.

Now, try to make this library.

make

Hopefully, this is successful (no error messages). A file hello_ext.so is created. Test it.

ipython

import hello_ext
hello_ext.greet()

Phew, that was quite a mouthful!

Testing PyVCF

We are now going to load the VCF file (from ipython).

import vcf
vcf_reader = vcf.Reader(open('/proj/g2015053/nobackup/chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz', 'r'))
for record in vcf_reader:
    print record.POS

If you run this to its natural completion, it will take a while, but now you can at least actually load the file, and extract all variant positions.

The actual task

Variants are found with varying density across the genome. In some cases, it can be relevant to find contiguous regions where the maximum distance of adjacent variants is below some threshold. You are to create a class with at least the following two methods void addPosition(int), and tuple maximumStretch(int islandLength). After all positions have been added, the Python library user should be able to call maximumStretch repeated times with different island lengths, retrieving the first and last position within the longest such island. Within an island, the distance between pos(i) and pos(i+1) should not exceed islandLength. This can be found with a simple for loop across the sequence of positions, but it would be hard to express as a simple series of vectorized operations. Thus, an explicit loop might be required in Python as well, which would be slow - hence our motivation for making a C++ toy library for this purpose.

Update the BOOST_PYTHON_MODULE declaration to expose your new class (or struct). Consult the documentation, search engines, sites like Stackoverflow, and the slides as needed. Test that everything works by getting a reasonable response like this:

import vcf
import ourfancylib
iCounter = ourfancylib.islandCounter()
vcf_reader = vcf.Reader(open('/proj/g2015053/nobackup/chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz', 'r'))
for record in vcf_reader:
    iCounter.addPosition(record.POS)   

print iCounter.maximumStretch(500)

Do a pure Python implementation as well and compare the time (and possibly memory) use for a single and multiple maximumStretch calls with different island lengths.

Extra task 1

Try to get all of this working locally on your own machine. If you can find Boost and Python prebuilt for your distribution/package manager, use them.

Extra task 2

If you have Matlab R2014b or later installed locally, you can try first calling the simple hello_ext extension (you might need to put it in a known Python lib path) with py.hello_ext.greet(), and then test calling your island-finding module, possibly by generating a sequence of randomly spaced markers.

Extra task 3 (mandatory part for day 2)

When this is done, you can consider to exchange more complex types between your C++ library and Python. What if you could simply pass vcf_reader to a new method in islandCounter, that would automatically read all positions? Do you think this would be beneficial for performance? Can you return the full stretch of positions in the stretch as an array or list, rather than just a tuple of the first and last positions?