Data Science

My research is quantitative, primarily using large-scale archival panel data. I am currently investigating the trajectory of basic life science ideas across hundreds of thousands of academic journal publications to thousands of real ventures.

Core data components I work with include a parsed version of the PubMed XML, the NIH ExPORTER, and I am also looking into updating the existing author disambiguation (Author-ity) for PubMed using a novel training set of thousands of externally verified author IDs. Connecting these data affords the analysis of over 70 million authorships on about 25 million scientific articles, as well as the identification of many real ventures that can be traced back to basic science ideas (bench-to-bedside).

Below is a set of links to codes and data sources that others may find helpful in developing their work on scientific innovation and the sociology of science. Please feel free to reach out with any questions (

Probabilistic gender designation: R code for calling the Genderize database

Linking NIH R01 PIs to PubMed Author IDs

Replication data for Circulation Research Letter (2018)

Replication data for bmj Research Article (2019)