Lecture 1 - Part 2 - Tidy Data

DOWNLOAD DATASETS

To download the datasets used in this tutorial, pleas see the following links
1. gapminder.tsv
2. pew.csv
3. billboard.csv
4. ebola.csv
5. tips.csv

RESHAPE DATA FROM WIDE TO LONG

To reshape data from wide to a long format, we can use the melt() method. The melt function accepts several arguments, however, the most frequently one is the id_vars, that is to specify the variable that will not be touched or melted to a long format.

In [4]:
# Change folder
In [5]:
cd "D:\Dropbox\CLASSES\Data Science for Finance\Python\Lecture 1 - Part 2 - Tidy Data"
D:\Dropbox\CLASSES\Data Science for Finance\Python\Lecture 1 - Part 2 - Tidy Data
In [1]:
import pandas as pd
In [6]:
pew = pd.read_csv('pew.csv')
In [7]:
pew.head()
Out[7]:
religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k >150k Don't know/refused
0 Agnostic 27 34 60 81 76 137 122 109 84 96
1 Atheist 12 27 37 52 35 70 73 59 74 76
2 Buddhist 27 21 30 34 33 58 62 39 53 54
3 Catholic 418 617 732 670 638 1116 949 792 633 1489
4 Don’t know/refused 15 14 15 11 10 35 21 17 18 116

In the above dataset, we a column named religion and 10 other columns, containing the income levels of the individuals. We would like to melt the other 10 columns to a long format but do not want to touch or melt the religion column. Therefore, the id_vars will be equal to 'religion'

In [9]:
reshaped = pew.melt(id_vars='religion')
In [10]:
reshaped.head()
Out[10]:
religion variable value
0 Agnostic <$10k 27
1 Atheist <$10k 12
2 Buddhist <$10k 27
3 Catholic <$10k 418
4 Don’t know/refused <$10k 15

MULTIPLE ID_VARS

When using more than one id_vars (i.e. keeping these variables as is, not melting them), we shall pass a python's list that will contain the names of the variables. Python lists are enclosed in squared brackets, each item is separated by a comma

In [11]:
# Impoprt the billboard.csv file
billboard = pd.read_csv('billboard.csv')
In [12]:
billboard.head()
Out[12]:
year artist.inverted track time genre date.entered date.peaked x1st.week x2nd.week x3rd.week ... x67th.week x68th.week x69th.week x70th.week x71st.week x72nd.week x73rd.week x74th.week x75th.week x76th.week
0 2000 Destiny's Child Independent Women Part I 3:38 Rock 2000-09-23 2000-11-18 78 63.0 49.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2000 Santana Maria, Maria 4:18 Rock 2000-02-12 2000-04-08 15 8.0 6.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2000 Savage Garden I Knew I Loved You 4:07 Rock 1999-10-23 2000-01-29 71 48.0 43.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2000 Madonna Music 3:45 Rock 2000-08-12 2000-09-16 41 23.0 18.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2000 Aguilera, Christina Come On Over Baby (All I Want Is You) 3:38 Rock 2000-08-05 2000-10-14 57 47.0 45.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 83 columns

In [13]:
# Melt the week variables and keep year artist.inversted track time genre date.entered data.peaked
In [15]:
reshaped = billboard.melt(['year','artist.inverted','track','time','genre','date.entered','date.peaked'])
In [17]:
reshaped.head()
Out[17]:
year artist.inverted track time genre date.entered date.peaked variable value
0 2000 Destiny's Child Independent Women Part I 3:38 Rock 2000-09-23 2000-11-18 x1st.week 78.0
1 2000 Santana Maria, Maria 4:18 Rock 2000-02-12 2000-04-08 x1st.week 15.0
2 2000 Savage Garden I Knew I Loved You 4:07 Rock 1999-10-23 2000-01-29 x1st.week 71.0
3 2000 Madonna Music 3:45 Rock 2000-08-12 2000-09-16 x1st.week 41.0
4 2000 Aguilera, Christina Come On Over Baby (All I Want Is You) 3:38 Rock 2000-08-05 2000-10-14 x1st.week 57.0

Question: What is the average of value column for each artist in the reshaped data

In [20]:
avg_by_artist = reshaped.groupby('artist.inverted')['value'].mean()
In [23]:
avg_by_artist.head()
Out[23]:
artist.inverted
2 Pac           85.428571
2Ge+her         90.000000
3 Doors Down    37.602740
504 Boyz        56.222222
98�             37.650000
Name: value, dtype: float64

TIP: Breaking Code on Multiple Lines

We can break code on multiple lines by wrapping the code in bracket. See the above code again, with brackets and on three lines

In [32]:
avg_by_artist = (reshaped
                 .groupby('artist.inverted')['value']
                 .mean())

MULTIPLE VARIABLES STORED IN ONE COLUMN

In [34]:
# load the ebola.csv file
ebola = pd.read_csv('ebola.csv')
In [35]:
ebola.head()
Out[35]:
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone Cases_Nigeria Cases_Senegal Cases_UnitedStates Cases_Spain Cases_Mali Deaths_Guinea Deaths_Liberia Deaths_SierraLeone Deaths_Nigeria Deaths_Senegal Deaths_UnitedStates Deaths_Spain Deaths_Mali
0 1/5/2015 289 2776.0 NaN 10030.0 NaN NaN NaN NaN NaN 1786.0 NaN 2977.0 NaN NaN NaN NaN NaN
1 1/4/2015 288 2775.0 NaN 9780.0 NaN NaN NaN NaN NaN 1781.0 NaN 2943.0 NaN NaN NaN NaN NaN
2 1/3/2015 287 2769.0 8166.0 9722.0 NaN NaN NaN NaN NaN 1767.0 3496.0 2915.0 NaN NaN NaN NaN NaN
3 1/2/2015 286 NaN 8157.0 NaN NaN NaN NaN NaN NaN NaN 3496.0 NaN NaN NaN NaN NaN NaN
4 12/31/2014 284 2730.0 8115.0 9633.0 NaN NaN NaN NaN NaN 1739.0 3471.0 2827.0 NaN NaN NaN NaN NaN
In [38]:
# Use Date and Day as id_vars and melt the rest.
ebola_melt = ebola.melt(id_vars=['Date', 'Day'])
In [39]:
ebola_melt.head()
Out[39]:
Date Day variable value
0 1/5/2015 289 Cases_Guinea 2776.0
1 1/4/2015 288 Cases_Guinea 2775.0
2 1/3/2015 287 Cases_Guinea 2769.0
3 1/2/2015 286 Cases_Guinea NaN
4 12/31/2014 284 Cases_Guinea 2730.0

Rename the newely created variables as case_country and deaths

In [41]:
ebola_melt = ebola.melt(
    id_vars=['Date', 'Day'],
    var_name ='case_country',
    value_name = 'deaths')
In [43]:
ebola_melt.head()
Out[43]:
Date Day case_country deaths
0 1/5/2015 289 Cases_Guinea 2776.0
1 1/4/2015 288 Cases_Guinea 2775.0
2 1/3/2015 287 Cases_Guinea 2769.0
3 1/2/2015 286 Cases_Guinea NaN
4 12/31/2014 284 Cases_Guinea 2730.0

SPLIT STRING VARIABLE

Pandas offer a split method in the str Accessor Methods. This can be found in the Pandas Documentation> API reference > Series > Accessors > String handling : Web link here https://pandas.pydata.org/pandas-docs/version/0.25/reference/series.html#string-handling

In [46]:
split = ebola_melt['case_country'].str.split('_', expand=True)
In [47]:
split.head()
Out[47]:
0 1
0 Cases Guinea
1 Cases Guinea
2 Cases Guinea
3 Cases Guinea
4 Cases Guinea

Explanation

ebola_melt['case_country'] specifies the column to be split
.str.split() = Uses the split function from the str group of methods
'_' tells the split function which character is used for splitting the text
expand = True : specifies that the split text should be written as new variables

CREATE COLUMNS IN DATAFRAME

Column can be added to a dataframe using constants or expressions or by appending existing dataframes. In the following examples, we shall first add a column with a constant value of 100, let us call this new column as test100

In [48]:
# Example 1: Add a constant variable
ebola_melt['test100'] = 100
In [49]:
ebola_melt.head()
Out[49]:
Date Day case_country deaths test100
0 1/5/2015 289 Cases_Guinea 2776.0 100
1 1/4/2015 288 Cases_Guinea 2775.0 100
2 1/3/2015 287 Cases_Guinea 2769.0 100
3 1/2/2015 286 Cases_Guinea NaN 100
4 12/31/2014 284 Cases_Guinea 2730.0 100
In [51]:
#Example 2: Add avarible using an expression. Divide deaths on 2
ebola_melt['death2'] = ebola_melt['deaths'] / 2
In [52]:
ebola_melt.head()
Out[52]:
Date Day case_country deaths test100 death2
0 1/5/2015 289 Cases_Guinea 2776.0 100 1388.0
1 1/4/2015 288 Cases_Guinea 2775.0 100 1387.5
2 1/3/2015 287 Cases_Guinea 2769.0 100 1384.5
3 1/2/2015 286 Cases_Guinea NaN 100 NaN
4 12/31/2014 284 Cases_Guinea 2730.0 100 1365.0
In [58]:
# Example 3: Append the split dataframe to ebola_melt dataframe
ebola_melt[['cases', 'country']] = split
In [63]:
# Why we used two square brackets in the above code?
ebola_melt.head()
Out[63]:
Date Day case_country deaths test100 death2 cases country
0 1/5/2015 289 Cases_Guinea 2776.0 100 1388.0 Cases Guinea
1 1/4/2015 288 Cases_Guinea 2775.0 100 1387.5 Cases Guinea
2 1/3/2015 287 Cases_Guinea 2769.0 100 1384.5 Cases Guinea
3 1/2/2015 286 Cases_Guinea NaN 100 NaN Cases Guinea
4 12/31/2014 284 Cases_Guinea 2730.0 100 1365.0 Cases Guinea

Delete a Column?

Say we want to drop column test100

In [61]:
testdrop = ebola_melt.drop('test100', axis = 'columns')
In [62]:
testdrop.head()
Out[62]:
Date Day case_country deaths death2 cases country
0 1/5/2015 289 Cases_Guinea 2776.0 1388.0 Cases Guinea
1 1/4/2015 288 Cases_Guinea 2775.0 1387.5 Cases Guinea
2 1/3/2015 287 Cases_Guinea 2769.0 1384.5 Cases Guinea
3 1/2/2015 286 Cases_Guinea NaN NaN Cases Guinea
4 12/31/2014 284 Cases_Guinea 2730.0 1365.0 Cases Guinea

RESHAPE FROM LONG TO WIDE

This is also called variables are stored in both rows and columns

In [ ]: