Data Science

Selecting columns based on type

The tidyverse and, in particular, dplyr, provides functions to select columns from a data frame. There are three scoped functions available: select_all, select_if and select_at. In this post, we’ll look at a particular application of select_if, i.e., capturing the names of numeric variables. A quick search using Google finds a few solutions to this problem. As an example data set, I’ll use the diamonds data set from the ggplot2 package.

Selecting columns based on type

Joint Statistical Meetings Talk

The Joint Statistical Meetings are being held in the beautiful city of Vancouver, British Columbia this year. I gave a talk on data visualization this year, which is a new one for me, but an area I’m quite excited about. I’ve been looking into the newer toolsets using Javascript graphics for a while now for different projects, and this talk gave me a chance to coalesce my thoughts about data visualization principles and usage, and how the new toolsets can help.

Quirks about running Rcpp on Windows through RStudio

Quirks about running Rcpp on Windows through RStudio This is a quick note about some tribulations I had running Rcpp (v. 0.12.12) code through RStudio (v. 1.0.143) on a Windows 7 box running R (v. 3.3.2). I also have RTools v. 3.4 installed. I fully admit that this may very well be specific to my box, but I suspect not. I kept running into problems with Rcpp complaining that (a) RTools wasn’t installed, and (b) the C++ compiler couldn’t find Rcpp.

Some thoughts on the downsides of current Data Science practice

Bert Huang has a nice blog talking about poor results of ML/AI algorithms in “wild” data, which echos some of my experience and thoughts. His conclusions are worth thinking about, IMO. 1. Big data is complex data. As we go out and collect more data from a finite world, we’re necessarily going to start collecting more and more interdependent data. Back when we had hundreds of people in our databases, it was plausible that none of our data examples were socially connected.

Annotated Facets with ggplot2

I was recently asked to do a panel of grouped boxplots of a continuous variable, with each panel representing a categorical grouping variable. This seems easy enough with ggplot2 and the facet_wrap function, but then my collaborator wanted p-values on the graphs! This post is my approach to the problem. First of all, one caveat. I’m a huge fan of Hadley Wickham’s tidyverse and so most of my code will reflect this ethos, including packages and pipes.

A follow-up to Crowdsourcing Research

Last month I published some thoughts on crowdsourcing research, inspired by Anthony Goldbloom’s talk at Statistical Programming DC on the Kaggle experience. Today, I found a rather similar discussion on crowdsourcing research (on the online version of the magazine Good) as a potential way to increase the accuracy of scientific research and reducing bias. I think more consideration needs to be made both by academia, funding agencies, journals and consumers of scientific and technological research to break silos and make progress accurate and reproducible, and finding new ways of preserving the profit imperative in technological progress that allows for the sharing and crowdsourcing of knowledge and research progress.

Crowdsourcing research

Last evening, Anthony Goldbloom, the founder of, gave a very nice talk at a joint Statistical Programming DC/Data Science DC event about the Kaggle experience and what can be learned from the results of their competitions. One of the take away messages was that crowdsourcing data problems to a diligent and motivated group of entrepreneurial data scientists can get you to the threshold of extracting signal and patterns from data far more quickly than if a closed and siloed group of analysts worked on the problem.

Reading fixed width formats in the Hadleyverse

This is an update to a previous post on reading fixed width formats in R. A new addition to the Hadleyverse is the package readr, which includes a function read_fwf to read fixed width format files. I’ll compare the LaF approach to the readr approach using the same dataset as before. The variable wt is generated from parsing the Stata load file as before. I want to read all the data in two columns: DRG and HOSPID.