Merges and Joins: From SQL to Stata: A Visual Guide That Actually Makes Sense
🔗 Merges and Joins: From SQL to Stata by Florian Oswald · JPE Data Editor
Florian Oswald is the data editor for the Journal of Political Economy. His job is to figure out why published results do not replicate. This guide to merges and joins exists because one of the most common culprits is a single Stata command: merge m:m.
The guide covers inner, left, right, and full joins across SQL, R (dplyr), Julia (DataFrames.jl), and Stata; all with clean diagrams and executable R code. But the real value is in the Stata-specific material starting at Section 5.
The merge m:m Problem
Stata’s merge m:m does not produce a relational join. It performs positional row-matching within key groups. The first row in the master pairs with the first row in the using file, second with second, and so on.
The output depends on sort order. Re-sort either dataset before the merge and you get different employee-contract pairs, different student-class assignments, different anything-paired-with-anything. The _merge == 3 count stays the same, so you get false confidence that the merge worked.
Oswald demonstrates this with a concrete example and a self-contained Stata script you can run yourself. Employees matched to contracts by position, then re-sorted and matched differently. Same code, same data, different results. Two researchers running the same code with different sort histories get different answers. The code is logically wrong even when results happen to replicate.
The Fix
Use joinby for a true Cartesian product within key groups, or better yet, find the actual unique key and use merge 1:1. The guide walks through both options with full Stata code.
The Rest
Sections 1–4 are adapted from R for Data Science (2nd ed) and provide a clean, visual introduction to joins built around the nycflights13 dataset. Keys, cardinality, filtering joins, semi-joins, anti-joins; all with diagrams that show exactly what happens with zero, one, and multiple matches.
The quick reference table at the end maps SQL, Stata, R, and Julia syntax for every join type. Worth bookmarking.
Why It Matters
Reproducibility crises in economics and social science often trace back to data processing errors, not analysis errors. A merge m:m that behaves differently depending on invisible sort order is exactly the kind of bug that passes peer review. Oswald’s guide is a public service.
If you use Stata, read Section 6. If you use any other tool and collaborate with Stata users, send them the link.
Crepi il lupo! 🐺