Doing Data Science Without Programming Knowledge? My Data Science Journey

January 28, 2019 By Pascal Schmidt personal

When you type in “How to become a Data Scientist” into any search engine, the first thing that jumps into your eyes are the requirements in bullet point format. On top of that list, you will always find programming or coding.

Almost all people say that it is absolutely vital to know how to program to do data science. I argue however, that data science can be done without any programming knowledge. Be assured, I am not talking about STATA or other point and click interfaces. I am talking about data science in R Studio.

How can so many people say the opposite then? Well, to become a good Data Scientist, you certainly have to know how to program. However, in order to do basic data science, all you need is some knowledge of the tidyverse.

What is the Tidyverse?

The tidyverse is a collection of packages that do data science for you. No knowledge about programming or computer science fundamentals are needed to understand and use tidyverse packages.

The tidyverse packages are opinionated. This means that every single function of the collection of tidyverse packages takes the same arguments in the same places of the function calls. This makes the tidyverse consistent and therefore, easy to learn. If you are interested in knowing more about the tidyverse, I posted a book review about “R for Data Science“, which talks about the tidyverse.

The Purpose of R

Yes, you are right that R is a programming language and it can be used as such. However, R ,unlike Python or other languages, was created in order to do data analysis. Consequently, it is much easier to get started with R for data analysis than any other language without knowing computer science fundamentals.

How to Do Data Science Without Any Programming Knowledge?

When learning how to program, almost everyone starts out with learning about data structures, if-else statements, for loops and while loops, and finally about creating functions for a more customizable analysis.

All these parts mentioned above are important in the long-run. You see, I put long-run in bold. Meaning, that the above parts are not important when you get started in data science. No one gets excited about for loops and if-else statements and frankly, they are not necessary for a data science project at all. In addition to that, for loops are even harder to understand and will even work against you in R. There is a neat package called purrr in the tidyverse that makes creating loops unnecessary. On top of that, you can vectorize a lot of problems instead of using loops.

The R Community

There are plenty of packages on CRAN and GitHub that won’t require any programming knowledge. The only requirement to use these packages are to read documentations. Therefore, everyone can use a linear regression or other machine learning algorithms by using packages others have built.

Learning the interpretation of such models and when to use certain models is (way) more important than knowing how to program.

My Own Experience of My Data Science Journey

First Baby Steps

My first exposure to programming was a university introductory course to linear regression in R. We used the lm() function in r and some base plotting functions for visualizations. After that class I wanted to learn about programming and tried out a Python course on Udemy. The course started by teaching data structures, subsetting, if-else statements, loops and functions.

My Frustrations With Coding

After I was about half-way through the course, I was supposed to build a tic tac toe game from scratch. I found it so incredibly difficult that I quit and did not continue the course until the end. I also remembered that I always asked myself how I would ever use if-else statements and for loops in a real program. How is something like this useful?:

for(i in 1:10) {
  print(i)
}

Some More Frustration with Coding and Data Science

After that I continued my university courses (I am a Statistics major). I continued my data science journey by copy pasting R code from tutorials for my assignments. All I had to do was switching out a data frame and some variables. I still was not learning how to do data science.

We Are Getting There but Not Quite Yet

My next approach to doing data science was when I started the Titanic competition on Kaggle. I was not really sure what I was doing but I really wanted to build some kind of project so desperately. I followed the tutorial by Trevor Stephens. He is using a lot of subsetting when doing data manipulations. I followed the tutorial and could reproduce his numbers, but I only understood around 20% of his code.

Finally Light at the End of the Tunnel

Now I have been exposed to R for about 8 months through Kaggle and some university course work and I still was not able to do data science. It was frustrating to me and I was looking for something that would let me do real data science projects. I started some other R courses on Udemy and finally discovered my real breakthrough through Data Camp.

From there, I was picking up ggplot2 and dplyr super quickly. All of a sudden, I was able to do some data analysis on easy data sets like the gapminder data. Through these results, I dug deeper and deeper (still scratched the surface though) into the language and used resources like stack overflow a lot more.

The Ultimate Data Science Boost

Thorugh university, I now took courses that were specifically teaching R and I got a lot more confident. The following semester, I did an internship at the BC Cancer Agency as a Data Science Trainee, where I coded every day in R for 8 hours.

During that time, my programming knowledge took off and for the first time I had to do real programming. I learned how to use for loops properly, how to write functions, and how to write clean and maintainable code. At the end of my internship, I was responsible for creating a report, where I had to reshape and wrangle data from 24 data frames. This little project took my tidyverse skills to the next level and let me appreciate even more the beauty (functionality) of the packages.

There is still a lot more to learn such as some more data structures and algorithms. I unfortunaly do not come from a Computer Science background so I have to catch up on these fundamentals. I also want to start learning Python in the near future as I see a lot of Data Science internship postings that require Python and cloud computing.

Concluding Thoughts

Lately, I have been reflecting a lot on my data science journey. I read a lot about the R community and what they are thinking about how data science should be taught. People like Hadley Wickham or David Robinson also favor the route I was taking (well not initially but eventually). This is, start doing data science first with the tidyverse and then learn how to program. This way, you’ll see results early on without getting discouraged.

The essence of this blog post is that you don’t need to know how to program to do data science. However, you eventually have to pick it up to become a good Data Scientist.

I hope you found this post useful and inspiring. Let me know about your data science journey in the comments below. How did you start out?

 

 

 

 

Post your comment