Broadening horizons: from DBA to Data Science #1
As an IT professional we are used to spending a lot of time learning about new technologies or how we can improve our knowledge on the products we work with. For more than 10 years now I have been working with SQL Server and still continue to learn new things about the product daily. However, another technology is slowly gaining a spot in my IT heart, Data Science.
Around a year ago I took some careful steps to get a better understanding of the “BI” side of SQL Server. Keep in mind I am a pure “Engine” guy. This means I am perfectly comfortable with tuning queries or SQL Server instance performance, but terms like “cubes”, “ETL” and “BIML” scare me. Even though I have some basic knowledge of some of the products in the SQL Server stack, like SSIS and SSAS, it isn’t my primary business to deal with those tools, so my knowledge isn’t at the level I would like it to be. So I did what I am used to doing, start digging and try and learn as much about the BI area as possible.
Pretty quickly I ran, head-first, into a concrete wall. The BI area that is related to SQL Server is enormous! There are so many technologies, techniques and areas of expertise that I simply did not know where to start. So at the end of March I decided to post a simple question on Twitter, “where to start when you want to learn more about BI/Analytics”.
The amount of responses was overwhelming to say the least and I got some great directions where to start, but more importantly, I learned that the term “BI” is pretty subjective. For me this meant that I had to find my own definition of BI. I started with some of the tips I received and pretty quickly got in touch with R. From there a learned R is a (small) part of a much bigger picture, “Data Science”.
Road to Data Scientist by Swami Chandrasekaran
I found that many of the technologies and skills used in the Data Science area matched perfectly with my definition of BI, or should I say analytics, and so my journey into Data Science started…
This is a question I get a lot when I tell people I am working towards a better understanding of the Data Science areas. The answer is pretty short: there is no switch. I love being a DBA and the feeling I get when I optimize a query to go from 50 to 2 seconds is one of pure victory. Think of it more as broadening my horizon. I can already think of various scenarios where Data Science techniques can improve my DBA work, for instance, analyzing and visualizing query performance using R. It also works the other way around, having a solid understanding of the relational model and the SQL query language will definitely help in my Data Science study.
So why are you writing all this in a blog post?
When talking to various people about my plans (many also SQL Server DBAs), I noticed many of them are either already working with techniques like Machine Learning or are also planning to learn more about Data Science. Since there is so much information out there about the various parts of Data Science I decided to write down my experiences hoping this might help those that are thinking of following the same road I am taking right now.
It also has another advantage. Writing things down helps immensely to keep the knowledge in your head. This makes it a win-win situation, other people can learn from my experiences while I get an extra knowledge “check” when writing a blog post.
I learned that it is important to have a plan when you decide to study an area that you are completely unfamiliar with. A plan should, in my opinion, always start with a goal, which is that place you want to reach when you start learning. In this case my goal is:
Using Data Science techniques to provide value and understanding from data
Since Data Science is a completely new area for me, which consists of many tools and techniques, I made for myself a learning path. For me following a path is very important. I can get carried away pretty quickly by specific subjects and a learning path can keep me on an efficient road to my goal.
For my learning path I use the Road to Data Scientist map by Swami Chandrasekaran which you can see above. It allows you to easily view the different Data Science areas and it suggests a learning order that makes sense to me.
Since I prefer to understand the fundamentals before diving into more advanced topics, the first step I am working on is getting my mathematics back on the level I think is required. This means I have to work my way back to linear algebra again. The last math lessons I received are quite a few years ago, this means I have a lot of ground to cover.
The first resource I use to get my math up-to-date is a (dutch) math book that focusses on getting math knowledge up-to-date from a middle school level towards high school / university. There is an English version available of the book called “All you need in maths!”, but I am sure there are many other books that cover the same math areas.
Next to the book, another resource has already proven itself invaluable for learning math, Khan Academy. Khan Academy has an enormous amount of math related videos that are completely free and generally easy to follow.
Many people informed me math is not strictly necessary when you want to work with techniques like Machine Learning. Personally I want to know how things ”work” before I use them, so I decided to start with math first.
The best way to look at this blog post is as an introduction how I go in touch with Data Science and how I plan to learn more about it. In the next article I will go into more detail where I am now in my study, what I learned and where I am experiencing difficulties.