Brian Zhang's blog

Statistics and other topics

Recent posts

Feb 4, 2020 · 1 min read
On NumPy Multithreading Two notes. First, numpy supports multithreading, and this can give you a speed boost in multicore environments! On Linux, I used top to verify that my numpy was indeed using multithreading, which it was. Second, multithreading can hurt performance when you’re running multiple Python / numpy processes at once. I was running into this issue, and got significant boost by limiting the number of numpy threads per process, in my case using import mkl; mkl.
Oct 25, 2019 · 1 min read
Fast Hierarchical Clustering Using fastcluster Do you use hierarchical clustering packages like R’s hclust or Python’s scipy.cluster.hierarchy.linkage in your workflow? If so, you’re using an \(O(N^3)\) algorithm1 and should switch to the fastcluster package, which provides \(O(N^2)\) routines for the most commonly used types of clustering. fastcluster is implemented in C++, with interfaces for C++, R, and Python. In particular, the Python interface mirrors scipy.cluster.hierarchy.linkage, and the R interface mirrors stats::hclust and flashClust::flashClust, so switching over is a no-brainer.
Sep 28, 2019 · 4 min read
Software Engineering Tools Across 4 Languages It’s been two years since I started blogging, as well as two years since I started my PhD in the Oxford Statistics department. While it’s been several months since my last post, I hope to get back into sharing some shorter posts and ideas going forward. To get myself writing again, I thought I’d broaden the scope of my posts beyond statistics and machine learning. In particular, over the past two years, I’ve found myself getting more interested in abstract math as well as software engineering.
Jan 22, 2019 · 9 min read
Subtle Observations on Range Queries For my current research, I’ve had to read Kelleher et al.’s excellent msprime paper (2016) for simulating genetic sequences under the coalescent with recombination. A small trick that is used in their algorithm is the data structure of a Fenwick tree or binary indexed tree. Since I also have a side interest in competitive programming (mainly through USACO and Project Euler), I took a bit more time to learn this data structure.
Oct 24, 2018 · 8 min read
Missing Heritability and Microaggressions Missing heritability is like microaggressions: many seemingly insignificant effects can add up. Two weeks ago, as I was taking a journey back to London Heathrow / Oxford, I came across a small connection in a journal article and a podcast. It was a nice moment of seeing two ideas click together. The podcast, which came second, was an episode of Nomad, a British podcast discussing Christian faith outside the institutional church.
Jul 10, 2018 · 14 min read
Random Graphs and Giant Components This post will introduce some of the ideas behind random graphs, a very exciting area of current probability research. As has been a theme in my posts so far, I try to emphasize a reproducible, computational example. In this case, we’ll be looking at the “giant component” and how that arises in random graphs. There’s a lot more than this example that I find exciting, so I’ve deferred a longer discussion on random graphs to the end of this post, with a lot of references for the interested reader.
Apr 4, 2018 · 8 min read
Distributions with SymPy Any good statistics student will need to do some integrals in her / his life. While I generally feel comfortable with simple integrals, I thought it might be worth setting up a workflow to help automate this process! Previously, especially coming from a physics background, I’ve worked a lot with Mathematica, an advanced version of the software available online as WolframAlpha. Mathematica is extremely powerful, but it’s not open-source and comes with a hefty license, so I decided to research alternatives.
Jan 30, 2018 · 10 min read
Clustering with K-Means and EM Introduction K-means and EM for Gaussian mixtures are two clustering algorithms commonly covered in machine learning courses. In this post, I’ll go through my implementations on some sample data. I won’t be going through much theory, as that can be easily found elsewhere. Instead I’ve focused on highlighting the following: Pretty visualizations in ggplot, with the helper packages deldir, ellipse, and knitr for animations. Structural similarities in the algorithms, by splitting up K-means into an E and M step.
Nov 25, 2017 · 4 min read
Statistics / ML Books At the start of the last post, I talked briefly about courses I’ve been working through. Here are some follow-up thoughts on good books!1 This post will focus on textbooks with a machine learning focus. I’ve read less of the classic statistics textbooks, as I hadn’t specialized much in statistics until my PhD. However, these are a few texts that are on my radar to consult: The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2009, 2nd Ed.
Nov 9, 2017 · 8 min read
Polynomial Regression Introduction: side courses As a PhD student in the UK system, I was expecting a lot less coursework, with my first year diving straight into research. However, there are still a lot of gaps in my knowledge, so I hope to always be on the lookout for learning opportunities, including side classes. At the moment, I’m hoping to follow along with these three courses and do some assignments from time to time: