How to start learning data science using python

If you are completely new to data science, then you may have query about what is data science and what would be data science career path for beginners! read that post.

Before you jump into data science project, make sure you have good understanding python coding, specially working with database, python variable and data types, python array, loops, python dictionary objects, class and object and numpy array.

I have separated each component and trying to simplify for all beginners who want to learn data science using python code, each small task will help you to understand the process step by step, and you can learn yourself without help.

python data science tutotrial

Data Science Library

To work with data science lifecycle we need to use many different modules and libraries, let us understand following core python libraries, and how to use them.

Ready to start with dataset

If you have completed all above tasks successfully, then you are now familiar with all required library and objects that we work with during data science project, so let’s start with a small dataset exercise, you can download MoMA data from github.

import numpy as np
import pandas as pd

print("welecome to MoMa dataset");

artists = pd.read_csv('../Python-VSCode/testdata/artists.csv')
print(artists)

There are two datasets, Artists and Artworks, which has around 15k data in each dataset, will be good to play with!

To simplify our understanding about data science, we classify our tutorials into three categories, first reading data from different data sources like excel, xml, rdbms, json etc, then data analysis and data visualization.

Data Processing

Data processing is the process of extracting data from various data source, where data are in different format, we need to write code to extract those data and fit into our standard format, so that become easy to analyse.
- Read XML data
- JSON data
- Excel mysql data
- CSV data
- HTML data
- MySql Data

Data Analysis

Analysing data to understand various relationships among different data based on usability, business requirement, analysed data should help in making business decision for stakeholders.

Measuring Data Variance

import statistics
dataset = [17, 19, 11, 21, 23, 46, 29]
output = statistics.variance(dataset) 
print(output)

Normal and Binomial Distribution
Poisson and Bernoulli Distribution
Data Correlation

Linear Regression

import pandas as pd
from matplotlib import pyplot
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression

data = {'year': [2021,2021,2021,2021,2021,2021,2022,2022,2022,2022,2022,2022,2022,2022,2022,2022,2023,2023,2023,2023,2023,2023,2023,2023],
        'month': [12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
        'interest_rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
        'unemployment_rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
        'index_price': [1574,1257,1432,1303,1256,1754,1804,1175,1201,1189,1130,1075,1047,915,933,958,971,949,874,882,876,802,804,785]        
        }

df = pd.DataFrame(data)

Simple Regression

Based on one input variable, our predict value change!

model = LinearRegression()

X=df['unemploymentRate'].values.reshape(-1,1) # it's 2D
Y=df['indexPrice'].values.reshape(-1,1) # it's 2D
lr= model.fit(X,Y)
_p= lr.predict(X)
print(_p)

Multiple Regression
Multiple regression is like linear regression, but with more than one independent value, meaning that we try to predict a value based on two or more variables.

For example index price change based on interest rate and unemployment rate, where the index_price is the dependent variable, and the 2 independent/input variables are: interest_rate and unemployment_rate
Logistic Regression

Data P-Value

p-value is probability value. check this site to learn probability or check this for free course on probability

model = LinearRegression()

X=df['unemployment_rate'].values.reshape(-1,1) # this  has shape (XXX, 1) - it's 2D
Y=df['index_price'].values.reshape(-1,1) # this  has shape (XXX, 1) - it's 2D
lr= model.fit(X,Y)
pValue= lr.predict(X)
print(pValue)

Data Visualization

In data visualization process, we need to create some graphical representation of data that will be easy to understand during presentation, we use different type of charts, colours etc.
- Chart Properties and Styling
  Matplotlib library in python is used for
- Plot and Scatter Plots
  We can use either of these two methods to display data in line pyplot.plot(df['interest_rate'], df['index_price']) or in dotted pyplot.scatter(df['unemployment_rate'], df['index_price'], color='green') form.
```
from matplotlib import pyplot

pyplot.plot(df['interest_rate'], df['index_price']) #OR
#pyplot.scatter(df['unemployment_rate'], df['index_price'], color='green')
pyplot.title('Index Price Vs Interest Rate', fontsize=14)
pyplot.xlabel('Interest Rate', fontsize=14)
pyplot.ylabel('Index Price', fontsize=14)
pyplot.grid(True)
pyplot.show()
```
- Python Heat Maps
- Python 3D Charts
- Geographical Data and Time Series