DOWNLOAD DATASETS¶
To download the datasets used in this tutorial, pleas see the following links
1. gapminder.tsv
2. pew.csv
3. billboard.csv
4. ebola.csv
5. tips.csv
RESHAPE DATA FROM WIDE TO LONG¶
To reshape data from wide to a long format, we can use the melt() method. The melt function accepts several arguments, however, the most frequently one is the id_vars, that is to specify the variable that will not be touched or melted to a long format.
# Change folder
cd "D:\Dropbox\CLASSES\Data Science for Finance\Python\Lecture 1 - Part 2 - Tidy Data"
import pandas as pd
pew = pd.read_csv('pew.csv')
pew.head()
In the above dataset, we a column named religion and 10 other columns, containing the income levels of the individuals. We would like to melt the other 10 columns to a long format but do not want to touch or melt the religion column. Therefore, the id_vars will be equal to 'religion'
reshaped = pew.melt(id_vars='religion')
reshaped.head()
MULTIPLE ID_VARS¶
When using more than one id_vars (i.e. keeping these variables as is, not melting them), we shall pass a python's list that will contain the names of the variables. Python lists are enclosed in squared brackets, each item is separated by a comma
# Impoprt the billboard.csv file
billboard = pd.read_csv('billboard.csv')
billboard.head()
# Melt the week variables and keep year artist.inversted track time genre date.entered data.peaked
reshaped = billboard.melt(['year','artist.inverted','track','time','genre','date.entered','date.peaked'])
reshaped.head()
Question: What is the average of value column for each artist in the reshaped data¶
avg_by_artist = reshaped.groupby('artist.inverted')['value'].mean()
avg_by_artist.head()
TIP: Breaking Code on Multiple Lines¶
We can break code on multiple lines by wrapping the code in bracket. See the above code again, with brackets and on three lines
avg_by_artist = (reshaped
.groupby('artist.inverted')['value']
.mean())
MULTIPLE VARIABLES STORED IN ONE COLUMN¶
# load the ebola.csv file
ebola = pd.read_csv('ebola.csv')
ebola.head()
# Use Date and Day as id_vars and melt the rest.
ebola_melt = ebola.melt(id_vars=['Date', 'Day'])
ebola_melt.head()
Rename the newely created variables as case_country and deaths¶
ebola_melt = ebola.melt(
id_vars=['Date', 'Day'],
var_name ='case_country',
value_name = 'deaths')
ebola_melt.head()
SPLIT STRING VARIABLE¶
Pandas offer a split method in the str Accessor Methods. This can be found in the Pandas Documentation> API reference > Series > Accessors > String handling : Web link here https://pandas.pydata.org/pandas-docs/version/0.25/reference/series.html#string-handling
split = ebola_melt['case_country'].str.split('_', expand=True)
split.head()
Explanation¶
ebola_melt['case_country'] specifies the column to be split
.str.split() = Uses the split function from the str group of methods
'_' tells the split function which character is used for splitting the text
expand = True : specifies that the split text should be written as new variables
CREATE COLUMNS IN DATAFRAME¶
Column can be added to a dataframe using constants or expressions or by appending existing dataframes. In the following examples, we shall first add a column with a constant value of 100, let us call this new column as test100
# Example 1: Add a constant variable
ebola_melt['test100'] = 100
ebola_melt.head()
#Example 2: Add avarible using an expression. Divide deaths on 2
ebola_melt['death2'] = ebola_melt['deaths'] / 2
ebola_melt.head()
# Example 3: Append the split dataframe to ebola_melt dataframe
ebola_melt[['cases', 'country']] = split
# Why we used two square brackets in the above code?
ebola_melt.head()
Delete a Column?¶
Say we want to drop column test100
testdrop = ebola_melt.drop('test100', axis = 'columns')
testdrop.head()
RESHAPE FROM LONG TO WIDE¶
This is also called variables are stored in both rows and columns