Want to start your career as a Data Scientist, but don’t know where to start? You are at the right place! Hey Guys, welcome to this awesome Data Science Tutorial blog, it will give you a kick start into data science world. Let’s look at what we will be learning today:

- Why Data Science?
- What is Data Science?
- Who is a Data Scientist?
- Job Trends
- How to solve a problem in Data Science?
- Data Science Components
- Data Scientist Job Roles

**Why Data Science?**

It’s been said that Data Scientist is the “Sexiest Job of the 21st century”. Why? Because over the past few years, companies have been storing their data. And this being done by each and every company, it has suddenly led to data explosion. Data has become the most abundant thing today.

But, what will you do with this data? Let’s understand this using an example:

Say, you have a company which makes mobile phones. You released your first product, and it became a massive hit. Every technology has a life, right? So, now its time to come up with something new. But you don’t know what should be innovated, so as to meet the expectations of the users, who are eagerly waiting for your next release?

Somebody, in your company comes up with an idea of using the user generated feedback and pick things which we feel users are expecting in the next release.

Comes in Data Science, you apply various data mining techniques like sentiment analysis etc and get the desired results.

It’s not only this, you can make better decisions, you can reduce your production costs by coming out with efficient ways, and give your customers what they actually want!

With this, there are countless benefits that Data Science can result in, and hence it has become absolutely necessary for your company to have a Data Science Team. Requirements like these led to “Data Science” as a subject today, and hence we are writing this blog on Data Science Tutorial for you. 🙂

**Data Science Tutorial: What is Data Science?**

The term Data Science has emerged recently with the evolution of mathematical statistics and data analysis. The journey has been amazing, we have accomplished so much today in the field of Data Science.

In the next few years, we will be able to predict the future as claimed by researchers from MIT. They already have reached a milestone in predicting the future, with their awesome research. They can now predict what will happen in the next scene of a movie, with their machine! How? Well it might be a little complex for you to understand as of now, but don’t worry by the end of this blog, you shall have an answer to that as well.

Coming back, we were talking about Data Science, it is also known as data driven science, which makes use of scientific methods, processes and systems to extract knowledge or insights from data in various forms, i.e either structured or unstructured.

What are these methods and processes, is what we are going to discuss in this Data Science Tutorial today.

Moving forward, who does all this brain storming, or who practices Data Science? A **Data Scientist**.

**Who is a Data Scientist?**

As you can see in the image, a Data Scientist is the master of all trades! He should be proficient in maths, he should be acing the Business field, and should have great Computer Science skills as well. Scared? Don’t be. Though you need to be good in all these fields, but even if you aren’t, you’re not alone! There is no such thing as “a complete data scientist”. If we talk about working in a corporate environment, the work is distributed among teams, wherein each team has their own expertise. But the thing is, you should be proficient in atleast one of these fields. Also, even if these skills are new to you, chill! It may take time, but these skills can be developed, and believe me it would be worth the time you will be investing. Why? Well, let’s look at the job trends.

**Data Scientist Job Trends**

Well, the graph says it all, not only there are lot of job openings for a data scientist, but the jobs are well-paid too! And no, our blog will not cover the salary figures, go google!

Well, we now know, learning data science actually makes sense, not only because it is very useful, but also you have a great career in it in the near future.

Let’s start our journey in learning data science now and begin with,

**How to solve a problem in Data Science?**

So now, let’s discuss how should one approach a problem and solve it with data science. Problems in Data Science are solved using Algorithms. But, the biggest thing to judge is which algorithm to use and when to use it?

Basically there are 5 kinds of problems which you can face in data science.

Let’s address each of these questions and the associated algorithms one by one:

**Is this A or B?**

With this question, we are referring to problems which have a categorical answer, as in problems which have a fixed solution, the answer could either be a yes or a no, 1 or 0, interested, maybe or not interested.

**For Example: **

Q. What will you have, Tea or Coffee?

Here, you cannot say you would want a coke! Since the question only offers tea or coffee, and hence you may answer one of these only.

When we have only two type of answers i.e yes or no, 1 or 0, it is called 2 – Class Classification. With more than two options, it is called Multi Class Classification.

Concluding, whenever you come across questions, the answer to which is categorical, in Data Science you will be solving these problems using Classification Algorithms.

The next problem in this Data Science Tutorial, that you may come across, maybe something like this,

**Is this weird?**

Questions like these deal with patterns and can be solved using Anomaly Detection algorithms.

**For Example:**

Try associating the problem “is this weird?” to this diagram,

What is weird in the above pattern? The red guy, isn’t it?

Whenever there is a break in pattern, the algorithm flags that particular event for us to review. A real world application of this algorithm has been implemented by Credit Card companies where in, any unusual transaction by a user is flagged for review. Hence implementing security and reducing human’s effort on surveillance.

Let’s look at the next problem in this Data Science Tutorial, don’t be scared, deals with maths!

**How much or How many?**

Those of you, who don’t like maths, be relieved! Regression algorithms are here!

So, whenever there is a problem which may ask for figures or numerical values, we solve it using Regression Algorithms.

**For Example:**

What will be the temperature for tomorrow?

Since we expect a numeric value in the response to this problem, we will solve it using Regression Algorithms.

Moving along in this Data Science Tutorial, let’s discuss the next algorithm,

**How is this organised?**

Say you have some data, now you don’t have any idea, how to make sense out of this data. Hence the question, how is this organised?

Well, you can solve it using clustering algorithms. How do they solve these problems? Let’s see:

Clustering algorithms group the data in terms of characteristics which are common. For example in the above diagram, the dots are organised based on colors. Similarly, be it any data, clustering algorithms try to apprehend what is common between them and hence “clusters” them together.

The next and final kind of problem in this Data Science Tutorial, that you may encounter is,

**What should I do next?**

Whenever you encounter a problem, wherein your computer has to make a decision based on the training that you have given it, it involves Reinforcement Algorithms.

**For Example:**

Your temperature control system, when it has to decide whether it should lower the temperature of the room, or increase it.

**How do these algorithms work?**

These algorithms are based on human psychology. We like being appreciated right? Computers implement these algorithms, and expect being appreciated when being trained. How? Let’s see.

Rather than teaching the computer what to do, you let it decide what to do, and at the end of that action, you give either a positive or a negative feedback. Hence, rather than defining what is right and what is wrong in your system, you let your system “decide” what to do, and in the end give a feedback.

It’s just like training your dog. You cannot control what your dog does, right? But you can scold him when he does wrong. Similarly, maybe patting him on the back when he does what is expected.

Let’s apply this understanding in the example above, imagine you are training the temperature control system, so whenever the no. of people in the room increase, there has to be an action taken by the system. Either lower the temperature or increase it. Since our system doesn’t understand anything, it takes a random decision, let’s suppose, it increases the temperature. Therefore, you give a negative feedback. With this, the computer understands whenever the number of people increase in the room, never increase the temperature.

Similarly for other actions, you shall give feedback. With each feedback your system is learning and hence becomes more accurate in its next decision, this type of learning is called Reinforcement Learning.

Now, the algorithms that we learnt above in this Data Science Tutorial involve a common “learning practice”. We are making the machine learn right?

**What is Machine Learning?**

It is a type of Artificial Intelligence that makes the computers capable of learning on their own i.e without explicitly being programmed. With machine learning, machines can update their own code, whenever they come across a new situation.

Concluding in this Data Science Tutorial, we now know Data Science is backed by Machine Learning and its algorithms for its analysis. How we do the analysis, where do we do it. Data Science further has some components which aids us in addressing all these questions.

Before that let me answer how MIT can predict the future, because I think you guys might be able to relate it now. So, researchers in MIT trained their model with movies and the computers learnt how humans respond, or how do they act before doing an action.

For example, when you are about shake hands with someone you take your hand out of your pocket, or maybe lean in on the person. Basically there is a “pre action” attached to every thing we do. The computer with the help of movies was trained on these “pre actions”. And by observing more and more movies, their computers were then able to predict what the character’s next action could be.

Easy ain’t it? Let me throw one more question at you then in this Data Science Tutorial! Which algorithm of Machine Learning they must have implemented in this?

**Data Science Components**

**1. Datasets**

What will you analyze on? Data, right? You need a lot of data which can be analyzed, this data is fed to your algorithms or analytical tools. You get this data from various researches conducted in the past.

**2. R Studio**

R is an open source programming language and software environment for statistical computing and graphics that is supported by the R foundation. The R language is used in an IDE called R Studio.

Why is it used?

**Programming and Statistical Language**- Apart from being used as a statistical language , it can also be used a programming language for analytical purposes.

- Data Analysis and Visualization
- Apart from being one of the most dominant analytics tools, R also is one of the most popular tools used for data visualization.

**Simple and Easy to Learn**- R is a simple and easy to learn, read & write

**Free and Open Source**- R is an example of a FLOSS (Free/Libre and Open Source Software) which means one can freely distribute copies of this software, read it’s source code, modify it, etc.

R Studio was sufficient for analysis, until our datasets became huge, also unstructured at the same time. This type of data was called Big Data.

**3. Big Data**

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

Now to tame this data, we had to come up with a tool, because no traditional software could handle this kind of data, and hence we came up with Hadoop.

**4. Hadoop**

Hadoop is a framework which helps us to **store** and** process** large datasets in parallel and in a distribution fashion.

Let’s focus on the store and process part of Hadoop.

**Store**

The storage part in Hadoop is handled by HDFS i.e Hadoop Distributed File System. It provides high availability across a distributed ecosystem. The way it function is like this, it breaks the incoming information into chunks, and distributes them to different nodes in a cluster, allowing distributed storage.

**Process**

MapReduce is the heart of Hadoop processing. The algorithms do two important tasks, map and reduce. The mappers break the task into smaller tasks which are processed parallely. Once, all the mappers do their share of work, they aggregate their results, and then these results are reduced to a simpler value by the Reduce process.

If we use Hadoop as our storage in Data Science it becomes difficult to process the input with R Studio, due to its inability to perform well in distributed environment, hence we have Spark R.

**5. Spark R**

It is an R package, that provides a lightweight way of using Apache Spark with R. Why will you use it over tradition R applications? Because, it provides a distributed data frame implementation that supports operation like selection, filtering, aggregation etc but on large datasets.

Take a breather now ! We are done with the technical part in this Data Science Tutorial, let’s look at it from your job perspective now. I think you would have googled the salaries by now for a data scientist, but still, let’s discuss the job roles which are available for you as a data scientist.

**Data Scientist Job Roles**

Some of the prominent Data Scientist job titles are:

- Data Scientist
- Data Engineer
- Data Architect
- Data Administrator
- Data Analyst
- Business Analyst
- Data/Analytics Manager
- Business Intelligence Manager

The Payscale.com chart in this Data Science Tutorial below shows the average Data Scientist salary by skills in the USA and India.

The time is ripe to up-skill in Data Science and Big Data Analytics to take advantage of the Data Science career opportunities that come your way. This brings us to the end of Data Science tutorial blog. I hope this blog was informative and added value to you. Now is the time to enter the Data Science world and become a successful Data Scientist.

This is the 2nd part of Data Science Tutorial For Beginners by Edureka