To observe interesting structure in data we often need to throw away quite alot of information ! This talk is centered around a small python library I have developed, and some applications that use it to help find structure in large DNA datasets, mainly as part of quality control strategies used in a sequencing lab.
I briefly introduce the idea of discarding information from large datasets in order to find useful structures in them, together with brief background on the DNA sequence data I work with. One or two case studies are presented. I briefly cover some of the python coding techniques and resources I have used such as iterator algebra, the multiprocessing module, caching results via pickling, and using a functional programming approach so as to maintain a small kernel-library, with clients of that library themselves defining functions to handle domain specific tasks such as file parsing, and passing these functions to the kernel.