Intro to Data Science¶
What is Data Science? The only thing we know for sure is that, Data Science is a broad term. It depends on the organization/company, the people involved, etc. This is an good representation:
Although, this one might be a little bit more accurate:
As a "Data Scientist" you'll be combining your domain knowledge "science", with large amounts of data, using programming and computers. Your tasks as Data Scientists will involve, highly technical tasks like Machine Learning to boring and repetitive operative tasks like scraping a website from data and importing it in a Database.
What does a Data Scientist do?¶
It's probably easier to define the tasks that are most common to Data Scientists that to give a general definition that applies to everybody.
_([Source from recommended article](https://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext))_
Getting the data
Depending on your company/organization, getting the data can be as simple as a SQL query or as difficult as scraping entire websites. The problem with this task is that it's not standardized.
Parsing and Cleaning the Data
Depending on your sources, you'll need to do a little bit of preparation. Excluding outliers, filling null values, translating values, etc.
Merging, combining data
If the data comes from different sources, merging it can be tedious. Specially if it's hard to define that piece of information that relates different data sources.
Doing the analysis
This involves your own domain expertise + the tools available for the job. For example, you need to know the principles of statistics and you can also use
statsmodels to simplify your job. The analysis part is usually iterative and involves other subtasks as visualizations, validation testing, etc.
The whole point of the analysis part is finding patterns in particular cases to build general models. Your models can be predictions, clusterings, or just automated reports. In a general sense, it's the result of all the previous phases.
Perfect analyses and models are useless if they're not repeatable and scalable. This phase depends on your own models, but it'll usually imply a cloud provider. From simple reports (emailed every day at 3AM from an AWS Lambda) to a Machine Learning Model (built on top of Tensor Flow and deployed on Google's cloud infrastructure).
The only thing that's certain¶
We're sorry we can't give you better answers to the question "What is Data Science?", but this is a new discipline and we're putting together definitions, scopes and responsibilities of different actors. But there's one thing that's certain: You need to know how to code...