DOWNLOAD DATASETS¶

To download the datasets used in this tutorial, pleas see the following links
1. gapminder.tsv
2. pew.csv
3. billboard.csv
4. ebola.csv
5. tips.csv

RESHAPE DATA FROM WIDE TO LONG¶

To reshape data from wide to a long format, we can use the melt() method. The melt function accepts several arguments, however, the most frequently one is the id_vars, that is to specify the variable that will not be touched or melted to a long format.

# Change folder

cd "D:\Dropbox\CLASSES\Data Science for Finance\Python\Lecture 1 - Part 2 - Tidy Data"

D:\Dropbox\CLASSES\Data Science for Finance\Python\Lecture 1 - Part 2 - Tidy Data

import pandas as pd

pew = pd.read_csv('pew.csv')

pew.head()

In the above dataset, we a column named religion and 10 other columns, containing the income levels of the individuals. We would like to melt the other 10 columns to a long format but do not want to touch or melt the religion column. Therefore, the id_vars will be equal to 'religion'

reshaped = pew.melt(id_vars='religion')

reshaped.head()

MULTIPLE ID_VARS¶

When using more than one id_vars (i.e. keeping these variables as is, not melting them), we shall pass a python's list that will contain the names of the variables. Python lists are enclosed in squared brackets, each item is separated by a comma

# Impoprt the billboard.csv file
billboard = pd.read_csv('billboard.csv')

billboard.head()

# Melt the week variables and keep year artist.inversted track time genre date.entered data.peaked

reshaped = billboard.melt(['year','artist.inverted','track','time','genre','date.entered','date.peaked'])

reshaped.head()

Question: What is the average of value column for each artist in the reshaped data¶

avg_by_artist = reshaped.groupby('artist.inverted')['value'].mean()

avg_by_artist.head()

artist.inverted
2 Pac           85.428571
2Ge+her         90.000000
3 Doors Down    37.602740
504 Boyz        56.222222
98�             37.650000
Name: value, dtype: float64

TIP: Breaking Code on Multiple Lines¶

We can break code on multiple lines by wrapping the code in bracket. See the above code again, with brackets and on three lines

avg_by_artist = (reshaped
                 .groupby('artist.inverted')['value']
                 .mean())

MULTIPLE VARIABLES STORED IN ONE COLUMN¶

# load the ebola.csv file
ebola = pd.read_csv('ebola.csv')

ebola.head()

# Use Date and Day as id_vars and melt the rest.
ebola_melt = ebola.melt(id_vars=['Date', 'Day'])

ebola_melt.head()

Rename the newely created variables as case_country and deaths¶

ebola_melt = ebola.melt(
    id_vars=['Date', 'Day'],
    var_name ='case_country',
    value_name = 'deaths')

ebola_melt.head()

SPLIT STRING VARIABLE¶

Pandas offer a split method in the str Accessor Methods. This can be found in the Pandas Documentation> API reference > Series > Accessors > String handling : Web link here https://pandas.pydata.org/pandas-docs/version/0.25/reference/series.html#string-handling

split = ebola_melt['case_country'].str.split('_', expand=True)

split.head()

Explanation¶

ebola_melt['case_country'] specifies the column to be split
.str.split() = Uses the split function from the str group of methods
'_' tells the split function which character is used for splitting the text
expand = True : specifies that the split text should be written as new variables

CREATE COLUMNS IN DATAFRAME¶

Column can be added to a dataframe using constants or expressions or by appending existing dataframes. In the following examples, we shall first add a column with a constant value of 100, let us call this new column as test100

# Example 1: Add a constant variable
ebola_melt['test100'] = 100

ebola_melt.head()

#Example 2: Add avarible using an expression. Divide deaths on 2
ebola_melt['death2'] = ebola_melt['deaths'] / 2

ebola_melt.head()

# Example 3: Append the split dataframe to ebola_melt dataframe
ebola_melt[['cases', 'country']] = split

# Why we used two square brackets in the above code?
ebola_melt.head()

Delete a Column?¶

Say we want to drop column test100

testdrop = ebola_melt.drop('test100', axis = 'columns')

testdrop.head()

RESHAPE FROM LONG TO WIDE¶

This is also called variables are stored in both rows and columns

	religion	<$10k	$10-20k	$20-30k	$30-40k	$40-50k	$50-75k	$75-100k	$100-150k	>150k	Don't know/refused
0	Agnostic	27	34	60	81	76	137	122	109	84	96
1	Atheist	12	27	37	52	35	70	73	59	74	76
2	Buddhist	27	21	30	34	33	58	62	39	53	54
3	Catholic	418	617	732	670	638	1116	949	792	633	1489
4	Don’t know/refused	15	14	15	11	10	35	21	17	18	116

	religion	variable	value
0	Agnostic	<$10k	27
1	Atheist	<$10k	12
2	Buddhist	<$10k	27
3	Catholic	<$10k	418
4	Don’t know/refused	<$10k	15

	year	artist.inverted	track	time	genre	date.entered	date.peaked	x1st.week	x2nd.week	x3rd.week	...	x67th.week	x68th.week	x69th.week	x70th.week	x71st.week	x72nd.week	x73rd.week	x74th.week	x75th.week	x76th.week
0	2000	Destiny's Child	Independent Women Part I	3:38	Rock	2000-09-23	2000-11-18	78	63.0	49.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2000	Santana	Maria, Maria	4:18	Rock	2000-02-12	2000-04-08	15	8.0	6.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2000	Savage Garden	I Knew I Loved You	4:07	Rock	1999-10-23	2000-01-29	71	48.0	43.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	2000	Madonna	Music	3:45	Rock	2000-08-12	2000-09-16	41	23.0	18.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	2000	Aguilera, Christina	Come On Over Baby (All I Want Is You)	3:38	Rock	2000-08-05	2000-10-14	57	47.0	45.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	year	artist.inverted	track	time	genre	date.entered	date.peaked	variable	value
0	2000	Destiny's Child	Independent Women Part I	3:38	Rock	2000-09-23	2000-11-18	x1st.week	78.0
1	2000	Santana	Maria, Maria	4:18	Rock	2000-02-12	2000-04-08	x1st.week	15.0
2	2000	Savage Garden	I Knew I Loved You	4:07	Rock	1999-10-23	2000-01-29	x1st.week	71.0
3	2000	Madonna	Music	3:45	Rock	2000-08-12	2000-09-16	x1st.week	41.0
4	2000	Aguilera, Christina	Come On Over Baby (All I Want Is You)	3:38	Rock	2000-08-05	2000-10-14	x1st.week	57.0

	Date	Day	Cases_Guinea	Cases_Liberia	Cases_SierraLeone	Cases_Nigeria	Cases_Senegal	Cases_UnitedStates	Cases_Spain	Cases_Mali	Deaths_Guinea	Deaths_Liberia	Deaths_SierraLeone	Deaths_Nigeria	Deaths_Senegal	Deaths_UnitedStates	Deaths_Spain	Deaths_Mali
0	1/5/2015	289	2776.0	NaN	10030.0	NaN	NaN	NaN	NaN	NaN	1786.0	NaN	2977.0	NaN	NaN	NaN	NaN	NaN
1	1/4/2015	288	2775.0	NaN	9780.0	NaN	NaN	NaN	NaN	NaN	1781.0	NaN	2943.0	NaN	NaN	NaN	NaN	NaN
2	1/3/2015	287	2769.0	8166.0	9722.0	NaN	NaN	NaN	NaN	NaN	1767.0	3496.0	2915.0	NaN	NaN	NaN	NaN	NaN
3	1/2/2015	286	NaN	8157.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3496.0	NaN	NaN	NaN	NaN	NaN	NaN
4	12/31/2014	284	2730.0	8115.0	9633.0	NaN	NaN	NaN	NaN	NaN	1739.0	3471.0	2827.0	NaN	NaN	NaN	NaN	NaN

	0	1
0	Cases	Guinea
1	Cases	Guinea
2	Cases	Guinea
3	Cases	Guinea
4	Cases	Guinea

Python : Tidy Data | Data Reshape | Long vs Wide Formats | Dataframes