Contributing to the Tidyverse (dbplyr)

Illustrations by Allison Horst, artist in residence at RStudio

I was acknowledged as a contributor to the version 2.0.0 release of dbplyr!

dbplyr is the database backend for the ‘data pliers dplyr’ data manipulation package in the tidyverse software suite of R statistical programming language.

Or to describe from ‘top-down’:

  • R is a computer programming language used by statisticians and others who want to interpret data.

  • tidyverse is a collection of software packages for the R language which makes it easier for R users to manipulate and process data. So much easier, that R is now taught to liberal arts post-graduate students to analyze data e.g. for environmental studies at Harvard Extension School. These students often have no prior experience in computer programming.

    The tidyverse was largely the creation of a New Zealander, Hadley Wickham, and it looks like he is the chief maintainer of the tidyverse software. Like R, tidyverse is ‘open source’, freely available for use and modification, and contributed to by many enthusiasts in the data science community.

  • dplyr is a software package in the tidyverse collection which does many of the common data manipulation tasks, such as filtering, changing, sorting, summarizing and selection.

  • dbplyr allows dplyr to interact with database backends.

My contributions to the free and open-source dbplyr are (ironically) related to dbplyr operation with Microsoft SQL Server ‘MSSQL’. In all credit to Microsoft, the basic versions of Microsoft SQL Server are freely available, as are client libraries (for use in Linux), and Microsoft also provides extensive freely available documentation.

As of 21st December 2020, my two accepted contributions (‘pull requests’) are:

  1. Cast as.double and as.numeric to FLOAT instead of NUMERIC

    In MSSQL, NUMERIC converts floating point number to integers, which is not what is intended for as.double and as.numeric in R.

  2. Use try_cast instead of cast for MSSQL version 11+ (2012+)

    In MSSQL, try_cast allows more elegant handling of invalid entries. try_cast returns NA (not available) in situations where cast will return an error.

As of 21st December 2020, I also have a currently open contribution (‘pull request’) to fix an error in my second contribution.

What I really would like to say is just how friendly Hadley Wickham and others have been in helping me contribute to and improve dbplyr.

Both in initial discussion and in the process of doing a ‘pull request’, Hadley and Kirrill Müller have answered the simplest of queries, amended my super-clumsy code and really encouraged me along! Hadley is an adjunct professor and something of a data science legend. I have not attended a formal computer programming class at high school, university or trade school, so I’m really humbled to feel like a valued contributor to the data science world.

(And why am I so interested in improving the operation of dbplyr with MSSQL? It is because I use dbplyr/dplyr to interrogate the Best Practice electronic medical record patient information database with my ‘near future’ patient care quality improvement tool GPstat!.)

David Fong
David Fong
Lead doctor, Kensington site, coHealth

My interests include sustainable development in low-resource populations, teaching and the uses of monitoring and evaluation in clinical practice.