Ïn this lab, we are going to continue with the Python library from yesterday.
During today's lab, we ask that you create a public github repository for your code if you didn't already, and create branches for the different subtasks you do.
Create a
When you are done, you are supposed to be able to create a subdirectory
Boost Python has an object class that can accept any Python object. Rather than adding individual position, we can accept a VCF reader. This reader is an iterable, we know that
since we can do the
Thankfully, there is a class
Each element in the resulting list will be a VCFRecord, which is also represented by a general Python object. You need to access the "POS" attribute explicitly, i.e.
Add a new method
This performance didn't impress anybody. The actual island length computation algorithm is fast, but the reading is painfully slow. Let's go to using more C++ here. libStatGen contains relatively
nice C++ APIs for several genomic formats. Look at the classes
From a parallelism perspective, one can note that none of these libraries actually decompress on a separate thread from parsing the text. At least in the Python case, the overhead of parsing might be large enough that it would make a bit of sense
to separate these tasks. You can try to see what performance you get with a compressed and uncompressed vcf file stored in CMake
CMakeLists.txt
file based on the slides that is able to generate the library you are creating. The CMakeLists file should
properly pick up the Boost and Python version as needed to build the library file. You might want to browse the net for documentation on the
FindBoost and FindPython CMake modules (as well as the notes in the slides).
build
, run ccmake ..
, generate your makefile, run make,
and then load your library in ipython
again. If you are up for it, you could look into generating config header files from CMake, and make
the actual module name (both the file name and the Python module name) configurable.
Calling PyVCF directly from C++
for
loop in Python. This means that it implements the specific __iter__
function in Python, which returns an iterator implementing relevant
next and get functions. In fact, you can call an arbitrary function on a Python object by writing code like obj("functionname")(arg1,arg2)
. This kind of makes sense, since
Python name binding is very flexible.
stl_input_iterator<T>
. This class can be constructed based on the object and will then start an iteration. For having a matching "end" iterator,
you create an empty iterator of the same type.
record.attr("POS")
.
Remember there is also an extract
template to get a desired C++ type from an arbitrary Python value. You will need to access the "POS" attribute and extract the proper int value.
useReader
based on this approach. It should iterate over the full reader until the end, internally doing what addPosition
did. Test it on a large
enough subset of our test file that you can get an accurate timing result. Is it faster than the more pure Python version?
Extra task - using libStatGen
VcfFileReader
, with its methods open
, readRecord
, and the class VcfRecord
. Try to integrate what you need into your CMake file
as nicely as possible. Time the end result for a vcf.gz file. Add a new interface where the Python code just passes the file name and your library uses libStatGen
behind the scenes. Is this version faster at loading?
/dev/shm
, which is basically a RAM disk. (Reading the uncompressed file from the physical disk will be slower for sure.)