On NumPy Multithreading —
Two notes. First, numpy supports multithreading, and this can give you a speed boost in multicore environments! On Linux, I used top to verify that my numpy was indeed using multithreading, which it was. Second, multithreading can hurt performance when you’re running multiple Python / numpy processes at once. I was running into this issue, and got significant boost by limiting the number of numpy threads per process, in my case using import mkl; mkl.

Fast Hierarchical Clustering Using fastcluster —
Do you use hierarchical clustering packages like R’s hclust or Python’s scipy.cluster.hierarchy.linkage in your workflow? If so, you’re using an \(O(N^3)\) algorithm1 and should switch to the fastcluster package, which provides \(O(N^2)\) routines for the most commonly used types of clustering.
fastcluster is implemented in C++, with interfaces for C++, R, and Python. In particular, the Python interface mirrors scipy.cluster.hierarchy.linkage, and the R interface mirrors stats::hclust and flashClust::flashClust, so switching over is a no-brainer.

Software Engineering Tools Across 4 Languages —
It’s been two years since I started blogging, as well as two years since I started my PhD in the Oxford Statistics department. While it’s been several months since my last post, I hope to get back into sharing some shorter posts and ideas going forward.
To get myself writing again, I thought I’d broaden the scope of my posts beyond statistics and machine learning. In particular, over the past two years, I’ve found myself getting more interested in abstract math as well as software engineering.

Subtle Observations on Range Queries —
For my current research, I’ve had to read Kelleher et al.’s excellent msprime paper (2016) for simulating genetic sequences under the coalescent with recombination. A small trick that is used in their algorithm is the data structure of a Fenwick tree or binary indexed tree. Since I also have a side interest in competitive programming (mainly through USACO and Project Euler), I took a bit more time to learn this data structure.

Missing Heritability and Microaggressions —
Missing heritability is like microaggressions: many seemingly insignificant effects can add up.
Two weeks ago, as I was taking a journey back to London Heathrow / Oxford, I came across a small connection in a journal article and a podcast. It was a nice moment of seeing two ideas click together.
The podcast, which came second, was an episode of Nomad, a British podcast discussing Christian faith outside the institutional church.

Random Graphs and Giant Components —
This post will introduce some of the ideas behind random graphs, a very exciting area of current probability research. As has been a theme in my posts so far, I try to emphasize a reproducible, computational example. In this case, we’ll be looking at the “giant component” and how that arises in random graphs.
There’s a lot more than this example that I find exciting, so I’ve deferred a longer discussion on random graphs to the end of this post, with a lot of references for the interested reader.

Distributions with SymPy —
Any good statistics student will need to do some integrals in her / his life. While I generally feel comfortable with simple integrals, I thought it might be worth setting up a workflow to help automate this process!
Previously, especially coming from a physics background, I’ve worked a lot with Mathematica, an advanced version of the software available online as WolframAlpha. Mathematica is extremely powerful, but it’s not open-source and comes with a hefty license, so I decided to research alternatives.

Clustering with K-Means and EM —
Introduction K-means and EM for Gaussian mixtures are two clustering algorithms commonly covered in machine learning courses. In this post, I’ll go through my implementations on some sample data.
I won’t be going through much theory, as that can be easily found elsewhere. Instead I’ve focused on highlighting the following:
Pretty visualizations in ggplot, with the helper packages deldir, ellipse, and knitr for animations.
Structural similarities in the algorithms, by splitting up K-means into an E and M step.

Statistics / ML Books —
At the start of the last post, I talked briefly about courses I’ve been working through. Here are some follow-up thoughts on good books!1
This post will focus on textbooks with a machine learning focus. I’ve read less of the classic statistics textbooks, as I hadn’t specialized much in statistics until my PhD. However, these are a few texts that are on my radar to consult:
The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2009, 2nd Ed.

Polynomial Regression —
Introduction: side courses As a PhD student in the UK system, I was expecting a lot less coursework, with my first year diving straight into research. However, there are still a lot of gaps in my knowledge, so I hope to always be on the lookout for learning opportunities, including side classes.
At the moment, I’m hoping to follow along with these three courses and do some assignments from time to time: