Data Cleaning

In [18]:
import pandas as pd
In [19]:
data = pd.read_csv('https://opendoors.pk/wp-content/uploads/2020/02/prices.csv')
In [20]:
data.head()
Out[20]:
Symbol Date Open High Low Close Volume
0 AABS 2019-09-17 196.50 197.00 195.00 195.00 1800
1 AABS 2019-09-16 192.00 193.00 192.00 193.00 600
2 AABS 2019-09-13 180.00 194.33 180.00 194.00 2800
3 AABS 2019-09-12 185.00 185.19 185.00 185.10 200
4 AABS 2019-09-05 183.00 183.00 183.00 183.00 1000
In [21]:
data.info()

RangeIndex: 567 entries, 0 to 566
Data columns (total 7 columns):
Symbol    567 non-null object
Date      567 non-null object
Open      567 non-null object
High      567 non-null object
Low       567 non-null object
Close     567 non-null object
Volume    567 non-null object
dtypes: object(7)
memory usage: 31.1+ KB

So it is an object, convert to numeric

In [22]:
# slice the Close column to make a seperarte series
close = data['Close']
In [23]:
# See the first value of close
close[1]
Out[23]:
'\xa0 193.00'

Remove the undesirable string \xa0

1. The Python list way

In [24]:
# First convert the series to a python list
plist = close.tolist()
In [25]:
# How many elements are there of plist
N = len(plist)
N
Out[25]:
567

Manually remove the unwanted characters from the first 5 elements of plist

In [26]:
plist[0] = plist[0].replace('\xa0 ', '')
In [27]:
plist[1] = plist[1].replace('\xa0 ', '')
plist[2] = plist[2].replace('\xa0 ', '')
plist[3] = plist[3].replace('\xa0 ', '')
plist[4] = plist[4].replace('\xa0 ', '')

We can use a loop

In [28]:
for i in range(N) :
    plist[i] = plist[i].replace('\xa0 ', '')
    
    # Convert string to float
    plist[i] = float(plist[i])
In [30]:
# List just 11 elements to see whether the values are proper float
plist[1:10]
Out[30]:
[193.0, 194.0, 185.1, 183.0, 180.55, 180.55, 189.5, 187.0, 189.91]

Get back the list to pandas series

In [31]:
close = pd.Series(plist)
In [34]:
close.head()
Out[34]:
0    195.0
1    193.0
2    194.0
3    185.1
4    183.0
dtype: float64

2. Pandas Way

The above fragements of code is a goog python exercise. However, there a better and easy strategy using the pandas' built-in str.split() method

In [35]:
close2 = data['Close']
In [81]:
splitted = close2.str.split('\xa0 ', expand = True)
In [84]:
splitted.info()

RangeIndex: 567 entries, 0 to 566
Data columns (total 2 columns):
0    567 non-null object
1    567 non-null object
dtypes: object(2)
memory usage: 8.9+ KB
In [92]:
# drop column 0 as it has no data
splitted.drop([0], axis = 'columns', inplace = True)
In [91]:
splitted.head()
Out[91]:
1
0 195.00
1 193.00
2 194.00
3 185.10
4 183.00
In [93]:
splitted.info()

RangeIndex: 567 entries, 0 to 566
Data columns (total 1 columns):
1    567 non-null object
dtypes: object(1)
memory usage: 4.5+ KB
In [96]:
# Convert the series from string object to float
splitted = splitted.astype(float)
In [98]:
splitted.describe()
Out[98]:
1
count 567.000000
mean 203.358272
std 39.738085
min 126.470000
25% 185.375000
50% 194.350000
75% 209.995000
max 331.000000
In [ ]:
 

Data Cleaning

In [18]:
import pandas as pd
In [19]:
data = pd.read_csv('https://opendoors.pk/wp-content/uploads/2020/02/prices.csv')
In [20]:
data.head()
Out[20]:
Symbol Date Open High Low Close Volume
0 AABS 2019-09-17 196.50 197.00 195.00 195.00 1800
1 AABS 2019-09-16 192.00 193.00 192.00 193.00 600
2 AABS 2019-09-13 180.00 194.33 180.00 194.00 2800
3 AABS 2019-09-12 185.00 185.19 185.00 185.10 200
4 AABS 2019-09-05 183.00 183.00 183.00 183.00 1000
In [21]:
data.info()

RangeIndex: 567 entries, 0 to 566
Data columns (total 7 columns):
Symbol    567 non-null object
Date      567 non-null object
Open      567 non-null object
High      567 non-null object
Low       567 non-null object
Close     567 non-null object
Volume    567 non-null object
dtypes: object(7)
memory usage: 31.1+ KB

So it is an object, convert to numeric

In [22]:
# slice the Close column to make a seperarte series
close = data['Close']
In [23]:
# See the first value of close
close[1]
Out[23]:
'\xa0 193.00'

Remove the undesirable string \xa0

1. The Python list way

In [24]:
# First convert the series to a python list
plist = close.tolist()
In [25]:
# How many elements are there of plist
N = len(plist)
N
Out[25]:
567

Manually remove the unwanted characters from the first 5 elements of plist

In [26]:
plist[0] = plist[0].replace('\xa0 ', '')
In [27]:
plist[1] = plist[1].replace('\xa0 ', '')
plist[2] = plist[2].replace('\xa0 ', '')
plist[3] = plist[3].replace('\xa0 ', '')
plist[4] = plist[4].replace('\xa0 ', '')

We can use a loop

In [28]:
for i in range(N) :
    plist[i] = plist[i].replace('\xa0 ', '')
    
    # Convert string to float
    plist[i] = float(plist[i])
In [30]:
# List just 11 elements to see whether the values are proper float
plist[1:10]
Out[30]:
[193.0, 194.0, 185.1, 183.0, 180.55, 180.55, 189.5, 187.0, 189.91]

Get back the list to pandas series

In [31]:
close = pd.Series(plist)
In [34]:
close.head()
Out[34]:
0    195.0
1    193.0
2    194.0
3    185.1
4    183.0
dtype: float64

2. Pandas Way

The above fragements of code is a goog python exercise. However, there a better and easy strategy using the pandas' built-in str.split() method

In [35]:
close2 = data['Close']
In [81]:
splitted = close2.str.split('\xa0 ', expand = True)
In [84]:
splitted.info()

RangeIndex: 567 entries, 0 to 566
Data columns (total 2 columns):
0    567 non-null object
1    567 non-null object
dtypes: object(2)
memory usage: 8.9+ KB
In [92]:
# drop column 0 as it has no data
splitted.drop([0], axis = 'columns', inplace = True)
In [91]:
splitted.head()
Out[91]:
1
0 195.00
1 193.00
2 194.00
3 185.10
4 183.00
In [93]:
splitted.info()

RangeIndex: 567 entries, 0 to 566
Data columns (total 1 columns):
1    567 non-null object
dtypes: object(1)
memory usage: 4.5+ KB
In [96]:
# Convert the series from string object to float
splitted = splitted.astype(float)
In [98]:
splitted.describe()
Out[98]:
1
count 567.000000
mean 203.358272
std 39.738085
min 126.470000
25% 185.375000
50% 194.350000
75% 209.995000
max 331.000000
In [ ]: