By Neale Swinnerton - Senior Data Engineer - infoNation
Getting started in the data science field can be daunting. There are so many different areas to explore and the technology choices are just as mind boggling. This post looks at some of the building blocks that will serve you well whichever areas of data science you gravitate towards.
Firstly, let's tackle the difference between data science and data engineering. Successful projects will rely just as much on both disciplines and the line between the two can get very blurred.
- Data Science is the algorithms that you apply to data to gain some insight
- Data Engineering is the compute, storage and network tools you bring to bear on the data to support the Data Science work.
Building Block 1 - Python Programming language
Topics: Data Science 50% / Data Engineering 50%
The programming language Python is currently the lingua franca of data science. The good news is that Python was designed to be a language that is easy for beginners but has powerful features to allow your systems to grow. It is a general purpose programming language, so you'll be just as happy building the engineering aspects as the analysis. The best place to start with python is the python website. When you've got to grips with the language itself you will find a large number of libraries to support your work.
- Numpy, SciPy for the numerical and scientific aspects
- Pandas for data extraction and cleanup
- SciKit-Learn for classic Machine Learning (ML)
- TensorFlow and Keras for Neural Networks and Deep Learning
- Torch focusing on GPUs to implement ML
Building Block 2 - The R Programming language
Topics: Data Science 80% / Data Engineering 20%
Whilst Python is used for much data science, another programming language, The R Language, is widely used in the statistical community and since much Data Science work has a basis in statistics there is considerable crossover. For example it is useful to be able to understand algorithms developed in 'R'. R has been developed with a different focus than Python and it is not really regarded as a general purpose programming language in the same way as python. You are unlikely to implement the engineering aspects of your system in R, although often the line is blurred. R has 1000s of libraries available at https://cran.r-project.org/
Building Block 3 - Jupyter Notebooks
Topics: Data Science 70% / Data Engineering 30%
When building analysis pipelines an important consideration is documenting the assumptions you have made and generating visualizations of any insights you have found. This has led to wide adoption of a literate programming style known as 'Notebooks'. The leading implementation of this is Jupyter which will allow you to run notebooks locally or hosted within your organization for sharing with other colleagues.
Building Block 4 - Cloud Computing
Topics: Data Science 20% / Data Engineering 80%
Large cloud computing services have long served the needs of data engineering by allowing developers to provision compute and storage resources into scalable data processing systems. In recent years they have started offering turn key general purpose machine learning and artificial intelligence services which can often get a project over the hard engineering problems and into more business focused analytics.
The leading providers are:
Building Block 5 - Community
There is a thriving data science community and participation is encouraged. Good places to get started are:
- Kaggle - The premium Data Science hub is where you will find competitions, sample data sets to experiment and online learning
- Data Science Stack Exchange - one of the StackExchange portfolio of sites providing a Q&A forum focused on Data Science
A career in Data Science and/or Data Engineering will be exciting and challenging in equal measure. The fields are both still relatively new and evolving at break-neck speed. The cloud service providers want to offer turn-key solutions, but data science by its nature is intimately involved with the dataset, what works for one data set may be inappropriate for another.
The value that a good data scientist can add is immense. Although we call it science, the intuitions you will develop are often more like art and that's why the field can be both delightful and frustrating. Getting involved with a community is the best way to see where you should be taking your next steps along this journey.
Ready to boost your data literacy with an online data science course from Southampton Data Science Academy?