My Data Science Internship Experience at Statistics Canada
August 26, 2019 By Pascal Schmidt personal
In this blog post, I’ll be talking about my data science internship experience at Statistics Canada, in Ottawa, and what I have learned throughout my 4 months.
First off, I am very happy that I went to Ottawa during the summer. A lot of people warned me about the winter months and advised me to return to Vancouver in September again.
What we will be covering in this post is:
- The data science interview process
- The nature of my project
- My biggest challenges
- What I have learned
- What I had wished for
- Statistics Canada soccer tournament
- A glimpse into the future
The Interview Process at Statistics Canada for my Data Science Internship
I was applying for a data science position at Statistics Canada. First, I had to fill out a form where I had to cross-check what kind of hard skills I have (R, SAS, Python, web scraping, machine learning, etc.). Moreover, they gave me two behavioral questions I had to write a minimum of 100 words about.
The second part of the data science interview process was a conference call where I got interviewed by two divisions. One was the health division and the other one was the demography division. Most of the questions I got were related to the projects I put on my resume.
I was asked what kind of feature engineering I did for the Titanic data set and how I improved my score. They also asked me about the women-child model, how I coded survivors versus non-survivors, and what kind of R packages I used.
Another project they asked me about was the one where I analyzed the Vancouver housing market from web scraped data. I got asked about the web scraping process and how I build the linear regression model. On top of that, the interviewers wanted to know what kind of variables could have improved my model.
The questions above basically made up the first half of the interview. I felt like I was rambling a lot because I have not reviewed my projects in detail and have forgotten some parts already. I definitely should have reviewed my projects better on my resume.
The second part of the interview was about my experience with the BC Cancer Agency. I was asked about the work I did, what packages and technologies I used, and about the size of the data sets, I worked with. At the BC Cancer Agency, I only worked with pretty small data sets that ranged from around 350 observations to 50,000 observations. At Statistics Canada, the data sets were a bit bigger and ranged from 350,000 observations to 2,500,000 observations.
After around 3 days, I got the offer from the demography division and accepted. After a few more weeks of checking my background and making sure that I am not a serial killer, I got the official offer. Yayyy!!!
The Nature of My Data Science Project
I worked with administrative data and census data. I analyzed which addresses are matching on census and various administrative data sets. I compared characteristics of matching addresses versus non-matching addresses and gave insight into how well administrative data compares to the actual census in 2016.
My Biggest Challenges
As mentioned earlier, I worked with data that ranged from around 350,000 observations to around 2,500,000 observations. My computer had around 8 gigabytes of RAM, so I was still good to work in R.
My biggest fear was that the data was going to be too big and exceed my RAM. In this case, I would have had to work in SAS (ughh). Fortunately, I got lucky enough to work in R and get the flexibility and creativity a programming language can offer.
Only 5% of my work consisted of using SAS. This was when I had to merge data sets of the entire Canadian population (40 million observations). The other 95% were done in R.
Sometimes, I had to improve my for loops when writing functions because they just took forever to run. Because of that, I would often use some sort of vectorization so speed up the process. On top of that, I also looked into the Rcpp
library and learned how to use it in combination with C++. Using C++ for loops made my work process infinitely faster and saved more time for other things.
Obviously, domain knowledge was also another challenge. Learning about different variables and acronyms takes a long time. Now, I am just getting used to the demographic jargon used in my division but have to leave already in two weeks. Oh well, it is what it is.
What I Have Learned
As mentioned earlier, I have learned how to write C++ code in R with the Rcpp
library. Here is also a blog post about a problem I had at work and how I used a C++ for loop to speed up my work processes. I got more familiar with the tidyverse
in general, data cleaning, and data processing.
In my team, there was one more guy who used R for his projects. We often exchanged toy data sets and asked each other for help with data cleaning tasks. Especially, avoiding for loops in R, speeding up data wrangling tasks, and making the code more readable were the challenges we imposed on ourselves. It was a lot of fun to come up with solutions to problems.
What I Had Wished For
As already mentioned in the above paragraphs, I only had one other member of my team who was using R. All others were using SAS and Excel. I wished I had been with a lot of experienced R programmers who were able to teach me about programming in general and of course R itself.
The job I had was more of a data analyst position, rather than a data scientist position. During the last month of my project, we built a logistic regression model and analyzed the odds ratios for various models. All in all, however, my project was more about producing numbers and graphs. In conclusion, I wished I had done a bit more science and were exposed to some more technical procedures.
Statistics Canada Soccer Tournament
The Statistics Canada soccer tournament was definitely the highlight during the 4 months here in Ottawa. The tournament lasted for 4 weeks and different divisions play each other. One of the jokes in the office was that I only got hired because of this tournament. I played in the 3rd division in Germany and was also a varsity soccer player for Simon Fraser University in Vancouver. I hope the hiring process was not solely based on my soccer abilities but at the end of the day, I am good either way.
Pro tip: For all future interns who want to apply for Statistics Canada during the summer, make sure to emphasize your soccer skills on your resume and your chances of getting hired will increase infinitely. Long story short, you are hired!!!!
A Glimpse into the Future
On a more serious note, I really enjoyed my internship and have worked with great people. The government is a very calm place to work at. For the future, I definitely want to try out an industry job. However, most of the industry jobs require Python as one of their requirements which I have not used yet in any of my previous jobs. I have used it before a bit and I think it is easy to pick up when one understands programming. However, this will be one hurdle to conquer when applying to jobs in the future.
I am going into my last semester and will be graduating in December. After that, I will be applying for masters programs in statistics for fall 2020 as well as for jobs at the beginning of January 2020. I will keep you updated about how that goes. It will definitely be a challenge competing with computer science graduates and statistics graduates for data science jobs. Scary and exciting at the same time.
If you are interested in the code I wrote for the project, then you can check that out on my Github.
Anyways, I hope you have enjoyed this post and if you have any questions about the internship or in general, let me know in the comments below.
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
Comments (5)
Great blog! 🙂 Been reading through this blog for things as a statistics major and it’s been very helpful
Hi,
Glad to be of help. If you have any questions let me know and I am happy to answer! Take care.
Hello, Thank you very much for sharing this information, I would like to know where do you find this posts of offers of an internship in Statistics Canada
I found mine through the university I attended. Some companies post their internship positions on LinkedIn or Indeed. Maybe try emailing them and ask what the best way is to get an internship position with them.
Hi Pascal,
Thanks for this informative blog.
I have a question: what is the interview process of Data Science internships like?
What skills to gain, any online courses and/or books to go through to learn DS (as a beginner), and any recommendations for interview prep specific books, like there is one for Software Engineering: Cracking the Coding Interview?