Bayesian debugging

jericson · January 12, 2023, 6:09am

Like the old saying goes: “You can take the programmer out of software development, but you can’t stop them from writing buggy code.” Ok, so that’s not a saying at all. But it does apply to me. Today I wrote a script to collect some data about which schools are most often favorited on College Confidential.

Looking through the data I saw some reassuring things. Schools like UCLA, University of Michigan, Texas A&M and Cornell were near the top of the list of most-favorited. But there were some strange things lsuch as a schools with more favorites than students currently enrolled, blank columns returned and 472 people favoriting The Florida School of Traditional Midwifery.^[1]

At this point, Bayesian thinking ^[2] kicked in. If you asked me before looking at the data how many favorites would The Florida School of Traditional Midwifery have, I’d have put a good deal of money on 0. Not 100% certain, but even one favorite would strike me as a misclick because, no offense to The Florida School of Traditional Midwifery, it’s not a very well-known school.

So there are a few things that might be going on:

The Florida School of Traditional Midwifery is far more popular than I suspected.
There’s something wrong with our data. Perhaps we changed an identifying number such that The Florida School of Traditional Midwifery was getting credited for some other school’s favorites.
I made some mistake in my code.

In my experience, data problems happen on occasion. We recently updated our data based on an outside provider. Mistakes happen. Also in my experience, the bug is usually in my own code. That’s not because I’m a bad programmer, but because code work often means solving a problem I’ve never solved before. Even existing code needs to be fixed from time to time because it’s run into a situation that hadn’t existed (or been noticed) before.

So I did a few things to verify that the data wasn’t causing my problem. I looked at the max and min favorite date for each school. If there was a particular date that reoccurred (especially if it’s a date when we changed something in the data), I’d have a smoking gun. No such luck.

Next I found a student who (according to my code) had expressed interest in The Florida School of Traditional Midwifery. I also found a “Chance Me” thread that this student had started. It mentioned some of the schools she’d^[3] selected, but showed no interest in becoming a midwife. Nothing about the favorite data for this user seemed out of place except for that one obscure school.

So I forced myself to go back to my code and look for mistakes. It only took a few minutes to find a possible problem. The name of the school came from a school profile table. It includes an id field that seemed to match the school_id from the table with favorite data. But then I discovered there was another table of school data that included a profile_id. For some reason (perhaps the order that the tables were initially populated), school_id matches profile_id for fully ⅓ of the rows.

It was just close enough that my initial tests of the query seemed correct, which is why I’d gone with it. If I’d dug a tiny bit deeper, I would have started to notice many more problems. For instance, the The Florida School of Traditional Midwifery profile number matched the George Washington University school identifier. If you asked me how many favorites GWU would get, I suppose a few hundred would be in the range of possibility.

This sort of problem solving is exactly what attracted me to programming in the first place. As frustrating as having technical problems can be, it’s that much rewarding to solve them.

Students of SQL will rightly suspect that NULL is somehow involved. ↩︎
The concept comes from a school of statistical analysis that incorporates new data into an existing (prior) model. Instead of taking new data at face valued, the method evaluates in in the context of other relevant data. Bayes theorem formalizes this idea:

${isplaystyle P(Aid B)={rac {P(Bid A)P(A)}{P(B)}}}$
But we don’t need to use the formula in most cases because we have an intuition of probability based on our own past experiences. So if it really turned out The Florida School of Traditional Midwifery was favorited as often as the data initially suggested, it would dramatically update our assumption about its popularity. The formula just puts that intuition into numbers. ↩︎
I’m making a Bayesian inference from the first name. I’m open to the possibility they don’t use the “she/her” pronouns. ↩︎