Data Scientists: Fresh Thinkers Vital to Any Organisation
In continuation of our previously published article about the role of a Data Engineer, this text brings to light the challenges both business and specialists face when it comes to explaining the main strengths and importance of a Data Scientist position.
In order to get a better understanding, we need to outline what responsibilities Data Scientists may have within a Data Science project, what tech stack they use to perform their tasks and how they interact with other positions when working in a team.
Generally speaking, without this increasingly important position, businesses would struggle to make sense of the vast amounts of data they are producing and collecting. Why are customers not coming back, or their satisfaction scores lower? Why are deliveries taking longer? How can the recommendation system work better? All these questions can only be answered by looking at the data by someone who knows how.
This is why Data Scientists, and all the roles dealing with data manipulation, are the newest “it” jobs in the tech job market around the world: there is a keen interest in specialists that are able to manage and make sense of data, and turn it into actionable business strategies. No matter the size of the company, or the industry, a good Data Scientist is gold.
But what is a “Data Scientist”? This pool of specialists is the hardest category to identify, not only because of its huge overlap with other professionals in this field, but also because it is a strong buzzword for many stakeholders in business and IT areas which can be interpreted in different ways.
At expertlead, after looking through many profiles and conducting market research, we decided to divide our huge Data Scientist pool into two categories: Applied Data Scientists or those who work with Machine Learning models and Academic or Generalistic Data Scientists - specialists from academic area who are able to build complex statistical or mathematical models, e.g. for risk prediction. This separation comes from the fact that not all modern Data Scientists are coming from computer science or software engineering backgrounds. However, those who come, transition much easier into a Machine Learning Engineer role (applied focus). At the same time, many Data Scientists (academic focus) having strong expertise in deep statistical and mathematical analysis or prototyping of models may not have enough experience in deploying these models into production. Especially when compared to Machine Learning Engineers or some Data Engineers who have this experience due to gained in-depth programming skills. Same logic works vice versa: many Machine Learning Experts may not have enough experience in scientific theory and methodology that Data Scientists with a solid academic background acquire. Which can be essential for specific types of data analysis.
Therefore, when it comes to identifying a common field of educational background, there is no pattern: it varies from Computer Science to Mathematical Statistics, Computational Biology or Physics. Therefore, we can see that in many cases specific knowledge of domains obtained in the university does not prevent one from becoming a Data Scientist, regardless of whether it is a more applied or generalist role, but forms a core strength of a specialist.
Depending on the scope of the skill-set, a Data Scientist can be involved in a great variety of tasks.
- If the whole data life-cycle is taken into consideration, the first step for a Data Scientist is to identify the problem with relevant stakeholders. This usually involves counterparts from the business, tech, and, if this is separate, a data/business intelligence department. This step is particularly important due to the fact that many businesses are not able to define the problem well enough to even apply a set of basic rules to it or define the process behind it. Therefore, the ability to identify and evaluate a problem in a clear way is key.
- The next step, which can also be performed by a Data Engineer depending on the team structure and project complexity, is data acquisition, gathering data from numerous sources, such as servers, logs, databases, APIs, online repositories, etc. to capture structured and unstructured relevant data.
- Same logic applies to the next procedure, data preparation, which involves cleaning of inconsistent data types or misspelled attributes and ensuring its consistency by converting discovered data into a common format.
- If the previous stages can be shared between different roles, then the next crucial following step of exploratory data analysis is usually performed by a Data Scientist. It includes identification and selection of the relevant variables that will be used to create an accurate model to tackle the business problem at hand.
- The next important phase is to get things into action - verify whether the model works and train it on a dataset in order to test and identify if it fits business requirements well. Successful testing of the selected model in the pre-production environment before its final deployment leads to the use of reports and dashboards to get real time analytics.
- Therefore, communication and visualization of the findings to key stakeholders and decision-makers in the organization paired with constant monitoring and maintenance of the project's performance get us to the final stages.
Whether by designing and training Machine Learning models or by running advanced statistical analyses, Data Scientists are going to use different skills and respective tech:
- Programming languages may vary from classics in the industry like Python and R to Scala, C, C++, Java, MATLAB, etc.
- Python libraries such as Matplotlib, Bokeh, Plotly or Seaborn, or R packages like ggplot and Shiny can be used for data analysis and visualization.
- In addition to R and MATLAB there are other tools available for statistical analysis, such as SPSS, SAS, MS Excel, etc.
- When it comes to databases, it highly depends on the company’s datasets and types of applications, but generally we can divide them into relational database systems like SQL Server, Oracle or MySQL and NoSQL systems like Cassandra and MongoDB.
- Some Data Scientists also have experience with Cloud, e.g. GCP and AWS or with so called Big Data tools such as Apache Hadoop for the distributed processing of large data sets, Apache Spark analytics engine or Apache Kafka stream-processing platform.
- With regards to Machine Learning and Deep Learning, many libraries depending on programming language are extremely useful when performing different tasks. For example, popular Python libraries include Pytorch (dominates research), Tensorflow (dominates industry and is used for Machine Learning applications such as neural networks) as well as Keras, Scikit-learn, Theano and etc. There is also the CARET package for supervised Machine Learning in R, or Java libraries such as the Deeplearning4j or MALLET package for statistical Natural Language Processing (NLP), or C++ libraries mlpack, Dlib toolkit or Shark.
- In addition, knowledge of Docker and Kubernetes can be essential for implementation of Machine Learning algorithms in production.
This list doesn’t include all the relevant technologies that are used by Data Scientists while working on a project. Neither does it enumerate technologies every Data Scientist must know. From our experience we see that many people can be more interested in being involved, or show greater strengths, in the research and development part of the project. Meanwhile, others show interest in the infrastructure and production work - there are a myriads of skill combinations.
Therefore, when we consider a Data Science team which is working on a project, it cannot solely consist of Data Scientists. There are many different positions that represent various branches of the data science field and they usually work together. Therefore, apart from Machine Learning Engineers, Data Engineers are also important to mention, and to differentiate from Data Scientists. They are individuals responsible for identifying, cleaning, integrating and organizing data from different sources, in a way that it can be used by Data Scientists or Data Analysts. In essence, they prepare the groundwork that makes Data Scientists’ jobs easier.
This groundwork is essential for Data Scientists to work on advanced predictive analytics by assessing potential future scenarios by using advanced statistical methods (e.g. clustering or time series analysis). Or by utilizing the field of AI, including Machine Learning and Deep Learning to predict behaviour in unprecedented ways by performing supervised, unsupervised or reinforcement learning techniques. It is important to mention that many projects Data Scientists are working on do not have a straightforward solution at first as it can be in many other IT areas - it is quite often the scenario that valuable insights about a single problem can be received only after a few months. This means that business should be ready to invest not just financial resources, but also enough time before they get the right solution. That is why it is extremely important for a Data Scientist to present the results to the stakeholders in a clear and concise manner and at the same time guide management on what to do with this information. A good specialist is expected not only to work with complex algorithms or manage large datasets, but also to be able to explain and convince business in his/her choice for a solution, be it “simple” strategy or a complex Machine Learning model as well as to ensure its maximum possible accuracy. These soft skills will help develop trust, as well as lead to further investment which is essential for running successful projects in the Data Science area.
Overall, Data Science is a huge field. Some will claim you need to master Python and SQL, while others will argue you cannot perform without Scala or Java, a Computer Science degree and complete fluency in Spark or Hadoop. Others swear by R and straight up statistical learning. Some say Matlab and linear math are bulletproof solutions. However, none of them are right or wrong. The reality is there is no single way to do data science, since every company has its own stack and every business has a data challenge requiring specific methods and knowledge.
In the job landscape, freelance Data Scientists and Machine Learning Engineers in particular have seen an increased interest amongst our partners, both on the employer and freelancer side. Companies are looking for a fresh perspective, without the financial burden of hiring a traditional consulting company. On the other hand, Data Scientists are also interested in freelance opportunities, to explore different facets of data science, to challenge themselves in different industries and with a greater variety of problems.
At expertlead we have a Data Science community of vetted freelancers. If you are interested in finding the perfect Data Scientist for your project, be it a specialist with a strong statistical or mathematical background or someone who is able to provide an end-to-end Machine Learning solution, feel free to reach out to us.