Ïn this lab, we are going to create a simple Python library in C++. The basic premise is that algorithms requiring tight loops can easily benefit from being run in a compiled language.
The background is that in genome analysis, one frequently focuses on a set of known variant positions. That is, rather than handling the full genome sequence of each individual, we note that the sequence is identical in most positions, and focus on a specific subset where we have high quality data regarding sequence variations.
Results in this form are frequently stored in so-called variant call format (VCF) files. This is a text-based format, with one line per
variant. Today, we will use a Python library to parse vcf files and feed those to our C++ code, on the UPPMAX cluster tintin
.
Login with your UPPMAX user account on tintin. This will require that you have an ssh client on your computer. If you don't, investigate how you install one on your machine. For Windows, you can use e.g. PuTTY. You might also want to consider cygwin, which allows you to run many *nix-like tools and libraries in a compatibility layer that makes Windows behave more like MacOS X or Linux. This is in contrast to true ported versions of e.g. Python. It can sometimes get confusing if you have both cygwin-based versions of libraries and truly native Windows versions. Sometimes, a specific library is not available in a true Windows version, while it is possible to build from source in cygwin.
Anyway, if you are using a command line, connect to the tintin
cluster with the following command:
ssh username@tintin.uppmax.uu.se
You are now connected to the login node. This is a shared resource that should not be used for long-winding computations. You can do ls
, short installations and compilations etc. If you do anything more heavy, you should submit a job or get an interactive shell on a node in the cluster. Getting an interactive shell for one hour can be done like this:
interactive -t 1:00:00 -A g2015053
The two flags specify the time limit and the name of the project allocation we have for this course. You might have to wait for a couple of seconds. Then, you have a normal shell again, but now it is running as a reserved allocation on a specific node.
The tintin cluster is configured with some software right away. If you want to have additional software available in your environment, you can load it. This basically means that the path and library settings are changed to access the new software. In our case, we want to load specific versions of g++, boost, and Python.
module load gcc/4.9.2 module load build-tools module load boost/1.59.0_gcc4.9.2 module load python/2.7.6
You will get a warning for loading boost. While it's inconvenient, it's OK and expected.
We will be using the vcf package and a current version of ipython.
We will also install a specific Python package, PyVCF. This can be done using the pip
tool.
On a machine you control yourself, you can simply run pip install pyvcf
. It will then be downloaded and installed in the common directories and work right away.
In the shared environment of UPPMAX, you can't do this. However, the solution is rather simple. You can put the libraries in your home directory and refer to those versions
instead.
pip install --user pyvcf pip install --user ipython
This will hopefully succeed. Now you can try it in ipython
.
ipython ... import vcf
This might not succeed. If so, get back to the command line and enter the following.
export PATH=$HOME/.local/bin:$PATH export LD_LIBRARY_PATH=$HOME/.local/lib:$LD_LIBRARY_PATH
These two lines will make sure that the directories where pip --user
puts the results are respected.
These settings can be included in the .bash_profile
file (there is also a file called .bashrc
, and we won't
go into details for which one to choose here). You can of course also add the module load
lines above there. Now, we have put the basic pieces together.
We have taken a "hello world" example for Boost Python and adapted the makefile for our setting.
You can clone the path https://github.com/cnettel/Shocksolution_Examples.git
(by using https, authentication is not needed). You might also want to fork this repository to later create your own package within this.
git clone https://github.com/cnettel/Shocksolution_Examples.git cd Shocksolution_Examples/Boost.python/minimal
Look at the different files. As you can see, some paths are hardcoded in the Makefile
. The tool python-config
can report those, but tomorrow we will be looking at using CMake for finding
these paths.
Now, try to make this library.
make
Hopefully, this is successful (no error messages). A file hello_ext.so
is created. Test it.
ipython import hello_ext hello_ext.greet()
Phew, that was quite a mouthful!
We are now going to load the VCF file (from ipython).
import vcf vcf_reader = vcf.Reader(open('/proj/g2015053/nobackup/chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz', 'r')) for record in vcf_reader: print record.POS
If you run this to its natural completion, it will take a while, but now you can at least actually load the file, and extract all variant positions.
Variants are found with varying density across the genome. In some cases, it can be relevant to find contiguous regions where the maximum distance of adjacent variants is below some threshold.
You are to create a class with at least the following two methods void addPosition(int)
, and tuple maximumStretch(int islandLength)
. After all positions have been added,
the Python library user should be able to call maximumStretch
repeated times with different island lengths, retrieving the first and last position within the longest such island. Within an island, the distance
between pos(i) and pos(i+1) should not exceed islandLength. This can be found with a simple for
loop across the sequence of positions, but it would be hard to express as a simple series of vectorized operations.
Thus, an explicit loop might be required in Python as well, which would be slow - hence our motivation for making a C++ toy library for this purpose.
Update the BOOST_PYTHON_MODULE
declaration to expose your new class (or struct). Consult the documentation, search engines, sites like Stackoverflow,
and the slides as needed. Test that everything works by getting a reasonable response like this:
import vcf import ourfancylib iCounter = ourfancylib.islandCounter() vcf_reader = vcf.Reader(open('/proj/g2015053/nobackup/chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.refpanel.ALL.vcf.gz', 'r')) for record in vcf_reader: iCounter.addPosition(record.POS) print iCounter.maximumStretch(500)
Do a pure Python implementation as well and compare the time (and possibly memory) use for a single and multiple maximumStretch
calls with different island lengths.
Try to get all of this working locally on your own machine. If you can find Boost and Python prebuilt for your distribution/package manager, use them.
If you have Matlab R2014b or later installed locally, you can try first calling the simple hello_ext
extension (you might need to put it in a known Python lib path) with py.hello_ext.greet()
,
and then test calling your island-finding module, possibly by generating a sequence of randomly spaced markers.
When this is done, you can consider to exchange more complex types between your C++ library and Python. What if you could simply pass vcf_reader
to a new method in islandCounter
, that would automatically read all positions?
Do you think this would be beneficial for performance? Can you return the full stretch of positions in the stretch as an array or list, rather than just a tuple of the first and last positions?