If you are completely new to data science, then you may have query about what is data science and what would be data science career path for beginners! read that post.
Before you jump into data science project, make sure you have good understanding python coding, specially working with database, python variable and data types, python array, loops, python dictionary objects, class and object and numpy array.
I have separated each component and trying to simplify for all beginners who want to learn data science using python code, each small task will help you to understand the process step by step, and you can learn yourself without help.
To work with data science lifecycle we need to use many different modules and libraries, let us understand following core python libraries, and how to use them.
If you have completed all above tasks successfully, then you are now familiar with all required library and objects that we work with during data science project, so let’s start with a small dataset exercise, you can download MoMA data from github.
import numpy as np import pandas as pd print("welecome to MoMa dataset"); artists = pd.read_csv('../Python-VSCode/testdata/artists.csv') print(artists)
There are two datasets, Artists and Artworks, which has around 15k data in each dataset, will be good to play with!
To simplify our understanding about data science, we classify our tutorials into three categories, first reading data from different data sources like excel, xml, rdbms, json etc, then data analysis and data visualization.
Data processing is the process of extracting data from various data source, where data are in different format, we need to write code to extract those data and fit into our standard format, so that become easy to analyse.
Analysing data to understand various relationships among different data based on usability, business requirement, analysed data should help in making business decision for stakeholders.
import statistics dataset = [17, 19, 11, 21, 23, 46, 29] output = statistics.variance(dataset) print(output)
import pandas as pd from matplotlib import pyplot from sklearn.linear_model import LogisticRegression from sklearn.linear_model import LinearRegression data = {'year': [2021,2021,2021,2021,2021,2021,2022,2022,2022,2022,2022,2022,2022,2022,2022,2022,2023,2023,2023,2023,2023,2023,2023,2023], 'month': [12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1], 'interest_rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75], 'unemployment_rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1], 'index_price': [1574,1257,1432,1303,1256,1754,1804,1175,1201,1189,1130,1075,1047,915,933,958,971,949,874,882,876,802,804,785] } df = pd.DataFrame(data)
Based on one input variable, our predict value change!
model = LinearRegression() X=df['unemploymentRate'].values.reshape(-1,1) # it's 2D Y=df['indexPrice'].values.reshape(-1,1) # it's 2D lr= model.fit(X,Y) _p= lr.predict(X) print(_p)
Multiple regression is like linear regression, but with more than one independent value, meaning that we try to predict a value based on two or more variables.
For example index price change based on interest rate and unemployment rate, where the index_price is the dependent variable, and the 2 independent/input variables are: interest_rate and unemployment_rate
p-value is probability value. check this site to learn probability or check this for free course on probability
model = LinearRegression() X=df['unemployment_rate'].values.reshape(-1,1) # this has shape (XXX, 1) - it's 2D Y=df['index_price'].values.reshape(-1,1) # this has shape (XXX, 1) - it's 2D lr= model.fit(X,Y) pValue= lr.predict(X) print(pValue)
In data visualization process, we need to create some graphical representation of data that will be easy to understand during presentation, we use different type of charts, colours etc.
Matplotlib library in python is used for
We can use either of these two methods to display data in line pyplot.plot(df['interest_rate'], df['index_price'])
or
in dotted pyplot.scatter(df['unemployment_rate'], df['index_price'], color='green')
form.
from matplotlib import pyplot pyplot.plot(df['interest_rate'], df['index_price']) #OR #pyplot.scatter(df['unemployment_rate'], df['index_price'], color='green') pyplot.title('Index Price Vs Interest Rate', fontsize=14) pyplot.xlabel('Interest Rate', fontsize=14) pyplot.ylabel('Index Price', fontsize=14) pyplot.grid(True) pyplot.show()
You may be interested to read: