Subtle Observations on Range Queries

For my current research, I’ve had to read Kelleher et al.’s excellent msprime paper (2016) for simulating genetic sequences under the coalescent with recombination. A small trick that is used in their algorithm is the data structure of a Fenwick tree or binary indexed tree. Since I also have a side interest in competitive programming (mainly through USACO and Project Euler), I took a bit more time to learn this data structure.

Continue reading

Missing Heritability and Microaggressions

Missing heritability is like microaggressions: many seemingly insignificant effects can add up. Two weeks ago, as I was taking a journey back to London Heathrow / Oxford, I came across a small connection in a journal article and a podcast. It was a nice moment of seeing two ideas click together. The podcast, which came second, was an episode of Nomad, a British podcast discussing Christian faith outside the institutional church.

Continue reading

Random Graphs and Giant Components

This post will introduce some of the ideas behind random graphs, a very exciting area of current probability research. As has been a theme in my posts so far, I try to emphasize a reproducible, computational example. In this case, we’ll be looking at the “giant component” and how that arises in random graphs. There’s a lot more than this example that I find exciting, so I’ve deferred a longer discussion on random graphs to the end of this post, with a lot of references for the interested reader.

Continue reading

Distributions with SymPy

Any good statistics student will need to do some integrals in her / his life. While I generally feel comfortable with simple integrals, I thought it might be worth setting up a workflow to help automate this process! Previously, especially coming from a physics background, I’ve worked a lot with Mathematica, an advanced version of the software available online as WolframAlpha. Mathematica is extremely powerful, but it’s not open-source and comes with a hefty license, so I decided to research alternatives.

Continue reading

Clustering with K-Means and EM

Introduction K-means and EM for Gaussian mixtures are two clustering algorithms commonly covered in machine learning courses. In this post, I’ll go through my implementations on some sample data. I won’t be going through much theory, as that can be easily found elsewhere. Instead I’ve focused on highlighting the following: Pretty visualizations in ggplot, with the helper packages deldir, ellipse, and knitr for animations. Structural similarities in the algorithms, by splitting up K-means into an E and M step.

Continue reading

Statistics / ML Books

At the start of the last post, I talked briefly about courses I’ve been working through. Here are some follow-up thoughts on good books!1 This post will focus on textbooks with a machine learning focus. I’ve read less of the classic statistics textbooks, as I hadn’t specialized much in statistics until my PhD. However, these are a few texts that are on my radar to consult: The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2009, 2nd Ed.

Continue reading

Polynomial Regression

Introduction: side courses As a PhD student in the UK system, I was expecting a lot less coursework, with my first year diving straight into research. However, there are still a lot of gaps in my knowledge, so I hope to always be on the lookout for learning opportunities, including side classes. At the moment, I’m hoping to follow along with these three courses and do some assignments from time to time:

Continue reading

Author's picture

Brian Zhang

Blog built using Hugo and blogdown. Theme is kakawait’s port of the Tranquilpeak theme, originally by Louis Barranqueiro. Cover image © Flickr, Creative Commons Attribution License, user alexwhite.

Statistics PhD Student

University of Oxford