Data Cleaning¶

import pandas as pd

data = pd.read_csv('https://opendoors.pk/wp-content/uploads/2020/02/prices.csv')

data.head()

data.info()


RangeIndex: 567 entries, 0 to 566
Data columns (total 7 columns):
Symbol    567 non-null object
Date      567 non-null object
Open      567 non-null object
High      567 non-null object
Low       567 non-null object
Close     567 non-null object
Volume    567 non-null object
dtypes: object(7)
memory usage: 31.1+ KB

So it is an object, convert to numeric¶

# slice the Close column to make a seperarte series
close = data['Close']

# See the first value of close
close[1]

'\xa0 193.00'

Remove the undesirable string \xa0¶

1. The Python list way¶

# First convert the series to a python list
plist = close.tolist()

# How many elements are there of plist
N = len(plist)
N

567

Manually remove the unwanted characters from the first 5 elements of plist¶

plist[0] = plist[0].replace('\xa0 ', '')

plist[1] = plist[1].replace('\xa0 ', '')
plist[2] = plist[2].replace('\xa0 ', '')
plist[3] = plist[3].replace('\xa0 ', '')
plist[4] = plist[4].replace('\xa0 ', '')

We can use a loop¶

for i in range(N) :
    plist[i] = plist[i].replace('\xa0 ', '')
    
    # Convert string to float
    plist[i] = float(plist[i])

# List just 11 elements to see whether the values are proper float
plist[1:10]

[193.0, 194.0, 185.1, 183.0, 180.55, 180.55, 189.5, 187.0, 189.91]

Get back the list to pandas series¶

close = pd.Series(plist)

close.head()

0    195.0
1    193.0
2    194.0
3    185.1
4    183.0
dtype: float64

2. Pandas Way¶

The above fragements of code is a goog python exercise. However, there a better and easy strategy using the pandas' built-in str.split() method

close2 = data['Close']

splitted = close2.str.split('\xa0 ', expand = True)

splitted.info()


RangeIndex: 567 entries, 0 to 566
Data columns (total 2 columns):
0    567 non-null object
1    567 non-null object
dtypes: object(2)
memory usage: 8.9+ KB

# drop column 0 as it has no data
splitted.drop([0], axis = 'columns', inplace = True)

splitted.head()

splitted.info()


RangeIndex: 567 entries, 0 to 566
Data columns (total 1 columns):
1    567 non-null object
dtypes: object(1)
memory usage: 4.5+ KB

# Convert the series from string object to float
splitted = splitted.astype(float)

splitted.describe()

Data Cleaning¶

import pandas as pd

data = pd.read_csv('https://opendoors.pk/wp-content/uploads/2020/02/prices.csv')

data.head()

data.info()


RangeIndex: 567 entries, 0 to 566
Data columns (total 7 columns):
Symbol    567 non-null object
Date      567 non-null object
Open      567 non-null object
High      567 non-null object
Low       567 non-null object
Close     567 non-null object
Volume    567 non-null object
dtypes: object(7)
memory usage: 31.1+ KB

So it is an object, convert to numeric¶

# slice the Close column to make a seperarte series
close = data['Close']

# See the first value of close
close[1]

'\xa0 193.00'

Remove the undesirable string \xa0¶

1. The Python list way¶

# First convert the series to a python list
plist = close.tolist()

# How many elements are there of plist
N = len(plist)
N

567

Manually remove the unwanted characters from the first 5 elements of plist¶

plist[0] = plist[0].replace('\xa0 ', '')

plist[1] = plist[1].replace('\xa0 ', '')
plist[2] = plist[2].replace('\xa0 ', '')
plist[3] = plist[3].replace('\xa0 ', '')
plist[4] = plist[4].replace('\xa0 ', '')

We can use a loop¶

for i in range(N) :
    plist[i] = plist[i].replace('\xa0 ', '')
    
    # Convert string to float
    plist[i] = float(plist[i])

# List just 11 elements to see whether the values are proper float
plist[1:10]

[193.0, 194.0, 185.1, 183.0, 180.55, 180.55, 189.5, 187.0, 189.91]

Get back the list to pandas series¶

close = pd.Series(plist)

close.head()

0    195.0
1    193.0
2    194.0
3    185.1
4    183.0
dtype: float64

2. Pandas Way¶

The above fragements of code is a goog python exercise. However, there a better and easy strategy using the pandas' built-in str.split() method

close2 = data['Close']

splitted = close2.str.split('\xa0 ', expand = True)

splitted.info()


RangeIndex: 567 entries, 0 to 566
Data columns (total 2 columns):
0    567 non-null object
1    567 non-null object
dtypes: object(2)
memory usage: 8.9+ KB

# drop column 0 as it has no data
splitted.drop([0], axis = 'columns', inplace = True)

splitted.head()

splitted.info()


RangeIndex: 567 entries, 0 to 566
Data columns (total 1 columns):
1    567 non-null object
dtypes: object(1)
memory usage: 4.5+ KB

# Convert the series from string object to float
splitted = splitted.astype(float)

splitted.describe()

	Symbol	Date	Open	High	Low	Close	Volume
0	AABS	2019-09-17	196.50	197.00	195.00	195.00	1800
1	AABS	2019-09-16	192.00	193.00	192.00	193.00	600
2	AABS	2019-09-13	180.00	194.33	180.00	194.00	2800
3	AABS	2019-09-12	185.00	185.19	185.00	185.10	200
4	AABS	2019-09-05	183.00	183.00	183.00	183.00	1000

	1
0	195.00
1	193.00
2	194.00
3	185.10
4	183.00

	1
count	567.000000
mean	203.358272
std	39.738085
min	126.470000
25%	185.375000
50%	194.350000
75%	209.995000
max	331.000000

	Symbol	Date	Open	High	Low	Close	Volume
0	AABS	2019-09-17	196.50	197.00	195.00	195.00	1800
1	AABS	2019-09-16	192.00	193.00	192.00	193.00	600
2	AABS	2019-09-13	180.00	194.33	180.00	194.00	2800
3	AABS	2019-09-12	185.00	185.19	185.00	185.10	200
4	AABS	2019-09-05	183.00	183.00	183.00	183.00	1000

	1
0	195.00
1	193.00
2	194.00
3	185.10
4	183.00

Data Cleaning in Pandas | Python

Data Cleaning¶

So it is an object, convert to numeric¶

Remove the undesirable string \xa0¶

1. The Python list way¶

Manually remove the unwanted characters from the first 5 elements of plist¶

We can use a loop¶

Get back the list to pandas series¶

2. Pandas Way¶

Data Cleaning¶

So it is an object, convert to numeric¶

Remove the undesirable string \xa0¶

1. The Python list way¶

Manually remove the unwanted characters from the first 5 elements of plist¶

We can use a loop¶

Get back the list to pandas series¶

2. Pandas Way¶