Why you should blog if you are a data scientist

A while back, David Robinson offered the following advice on Twitter:

When you’ve written the same code 3 times, write a function

When you’ve given the same in-person advice 3 times, write a blog post
— David Robinson (@drob) November 9, 2017

He also followed it up with an excellent blog post aimed at aspiring data scientists. I think that the most important idea from that post is actually from a presentation he gave. Here’s the key idea, as reported by Amelia McNamara:

"Things that are still on your computer are approximately useless." -@drob #eUSR #eUSR2017 pic.twitter.com/nS3IBiRHBn
— Amelia McNamara (@AmeliaMN) November 3, 2017

In summary, things that you keep to yourself have very little value, and things that you share with the world have a lot of value. I think that’s a very cool idea for the Economics of information. For an extreme case, imagine you have a trade secret that allows you to solve a specific class of problems faster than others. In this case, it’s understandable that you want to keep the trade secret to yourself. But it makes sense to advertise to the world that you can quickly solve some kinds problems faster than others. That makes your trade secret more valuable.

The discipline of writing about what you’re doing

When I was a wee little kid and had just entered college, one of my first classes was “Physics Lab”, and the first class of that was to measure “gravity”, more specifically, the gravitational acceleration . Of course, mathy Computer Science studies that we were, we all knew that the gravitational acceleration would be \(g \approx 9.8 m/s^2\).

Measuring it, however, it’s not very easy. First, remember that this is in the early 90s in Brazil, so digital cameras were very rare. Part of the problem is that 9.8 meters (approximately 30 feet) is quite high, so if we wanted our experiment to take around 1s, we would need a big ladder. Or, as it happened, we’d need to run an experiment that took less than a second for each run. In a class with 25 students, that was the preferred route.

We set up a vertical track attached to a device that would spark every 1/60 of a second, and we attached grid paper to the track. This is called a Behr free fall apparatus.

We would release the device from the top of the track, and the sparks would mark the grid paper. We would then manually measure the distances between the sparks and the differences between the distances would tell us the acceleration. In theory. Again, remember this is the early 90s, so there’s no Windows 95 or Excel easily available, there were several steps that were prone to error.

The desired outcome was not only to calculate the acceleration due to gravity, but also to generate a lab report.

Writing the lab report

It was a simple experiment, but there was a twist. Another class of 25 students would have to replicate the experiment following the lab report from the first 25. Oh boy. We quickly found out that the best lab reports were the ones in which the experimenter would document the experiment as they were executing the experiment. Another thing that worked was going through the experiment more than once. What did not work was to perform the experiment and then go to where the computers were to type up a report from memory.

I thought that was very insightful, shortly afterwards, we would learn about Donald Knuth’s proposed paradigm of literate programming, which lives on in Jupyter Notebooks and R Markdown and that is very popular in data science today. Behind it is the same concept of explanations interspersed with technical work to produce something that is reproducible and easy to understand.

Blogging helps you train to write about your work

Besides the excellent reasons offered by David Robinson for aspiring data scientists to blog, I think experienced data scientists can also benefit from blogging. David argues that aspiring data scientists should blog for technical practice, to build a portfolio, and to get feedback. Experienced data scientists might benefit for similar reasons, but not exactly the same.

Similar to “practicing”, blogging about technical topics helps you learn how to communicate better about them, and also helps you learn how to write better research reports. Similar to “portfolio”, it signals a data scientist’s body of work. For an early-in-career data scientist, it may be more showing that you can do analysis, but for an experienced data scientist is more about what type of analysis you like to do. Finally, I think that “feedback” also works differently for early-in-career and experienced data scientists. I’m not sure random strangers in the internet will sweep in and correct your posts, but you can find out what people are reading and sharing. You may be surprised when you learn what parts of your work people find interesting.