0%

Pandas笔记(紧急ver.)

Lab8 Pandas_1

Exam1

1. Create a 5-D random numpy list var_list

1
2
import numpy as np
var_list = np.random.randn(5)

2. use “uuid” to generate 5 random keys (use str(uuid.uuid4())), and store them into a list key_list

1
2
import uuid
key_list = [str(uuid.uuid4())[:6] for i in range(5)]

3. Create a dictionary dict from var_list and key_list

1
dict_tmp = {key_list[i]:var_list[i] for i in range(5)}

4. Create a Pandas Series from a) var_list b) var_list and key_list c) dict

1
2
3
4
import pandas as pd
pd_series_var = pd.Series(var_list)
pd_series_var_key = pd.Series(var_list,key_list)
pd_series_dict = pd.Series(dict_tmp)

5. Convert the Series back to the list and dictionary

1
2
var_list_new = pd_series_var_key.to_list()
dict_new = pd_series_dict.to_dict()

6. Find out the elements larger than zero

1
pd_series_positive = pd.series_dict[pd_series_dict > 0]

7. Calculate the proportion of positive elements in the Series

1
proportion = len(pd_series_postive)/len(pd_series_dict)

8. Write down as many ways of forming a list that contains the values of Series elements

1
2
3
4
5
6
7
8
val_1 = pd_series_dict.to_list()
val_2 = []

for idx, ival in pd_series_dict.iteritems():
val_2.append(ival)
val_3 = pd_series_dict.values
val_tmp = pd_series_dict.index
val_4 = [pd_series_dict[ikey] for ikey in pd_series_dict.index]

9. Calculate the proportion of elements that are larger than the mean value of the Series

1
2
3
mean_val = np.mean(pd_series_dict)
pd_series_larger_than_mean = pd_series_dict[pd_series_dict > mean_val]
proportion_2 = len(pd_series_larger_than_mean)/len(pd_series_dict)

Exam2

1. Write codes to create a random dict x which has 5 random keys and each key corresponds to a 6-D numpy array

1
2
3
4
5
6
7
import uuid
data = {}
key_list = []
for i in range(5):
rand_key = str(uuid.uuid4())[:6]
key_list.append(rand_key)
data[rand_key] = np.random.randn(6)

2. Create a pandas dataframe using x

1
2
df = pd.DataFrame(data)
print(df)

3. Create a pandas dataframe using a subset of x, in the subset of x, only keys that start with a digit are chosen

1
2
3
4
sub_key_list = [i_key for i_key in key_list if i_key[0] in '0123456789']
print(sub_key_list)
df_sub = pd.DataFrame(data,columns = sub_key_list)
print(df_sub)

4. Create a new pandas dataframe using the codes in the previous slide

1
2
3
dates = pd.data_range('1/1/2000',periods = 8)
df = pd.DataFrame(np.random.randn(8,4),index = dates, columns = ['A','B','C','D'])
print(df)

5. Select rows whose attribute A is smaller than the mean of attribute C

1
2
3
4
5
df_c = df['C']
print(df_c)
mean_c = np.mean(df_c)
print(mean_c)
print(df[df['A']<mean_c])

6. Can you select the column B and C using [] indexing? Try it out and see what happens

1
2
df_bc = df[['B','C']]
print(df_bc)

Exam3

convert a Panda module Series to Python list

1
2
3
4
5
6
7
8
9
10
import pandas as pd

series = pd.Series([1, 2, 3, 4, 5])
print("Pandas Series and type")
print(series)
print(type(series))

print("Convert Pandas Series to Python list")
print(series.tolist())
print(type(series.tolist()))
1
2
3
4
5
6
7
8
9
10
11
Pandas Series and type
0 1
1 2
2 3
3 4
4 5
dtype: int64
<class 'pandas.core.series.Series'>
Convert Pandas Series to Python list
[1, 2, 3, 4, 5]
<class 'list'>

convert a dictionary to a Pandas series

1
2
3
4
5
6
7
8
9
import pandas as pd

dict = {'a': 100, 'b': 200, 'c':300, 'd':400, 'e':500}
print("Original dictionary:")
print(dict)

new_series = pd.Series(dict)
print("Converted series:")
print(new_series)
1
2
3
4
5
6
7
8
9
Original dictionary:
{'a': 100, 'b': 200, 'c': 300, 'd': 400, 'e': 500}
Converted series:
a 100
b 200
c 300
d 400
e 500
dtype: int64

convert a NumPy array to a Pandas series

1
2
3
4
5
6
7
8
9
10
import numpy as np
import pandas as pd

np_array = np.array([1, 2, 3, 4, 5])
print("NumPy array:")
print(np_array)

new_series = pd.Series(np_array)
print("Converted Pandas series:")
print(new_series)
1
2
3
4
5
6
7
8
9
NumPy array:
[1 2 3 4 5]
Converted Pandas series:
0 1
1 2
2 3
3 4
4 5
dtype: int32

convert the column of a DataFrame as a Series

1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd

d = {'col1': [1, 2, 3, 4, 7, 11], 'col2': [4, 5, 6, 9, 5, 0], 'col3': [7, 5, 8, 12, 1,11]}
df = pd.DataFrame(data=d)
print(type(df))

print("Original DataFrame")
print(df)
s1 = df.iloc[:,0]

print("\n1st column as a Series:")
print(s1)
print(type(s1))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<class 'pandas.core.frame.DataFrame'>
Original DataFrame
col1 col2 col3
0 1 4 7
1 2 5 5
2 3 6 8
3 4 9 12
4 7 5 1
5 11 0 11

1st column as a Series:
0 1
1 2
2 3
3 4
4 7
5 11
Name: col1, dtype: int64
<class 'pandas.core.series.Series'>

Create a subset of a series

1
2
3
4
5
6
7
8
9
10
mport pandas as pd

s = pd.Series([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
print("Original Data Series:")
print(s)

print("\nSubset of the above Data Series:")
n = 10
new_s = s[s > n]
print(new_s)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Original Data Series:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
dtype: int64

Subset of the above Data Series:
11 11
12 12
13 13
14 14
15 15
dtype: int64

Lab9 Pandas_2

1
2
import numpy as np
import pandas as pd

1. Write codes to create two DataFrames df_left, df_right, with the columns as “[key, lval1, lval2]” and “[key, rval1, rval2]”, and the values are “[a,b,c]”, and “[b,c,d]” respectively. Generate random numbers with normal distribution to for the “lval” and “rval” elements

1
2
3
4
5
6
split_str = '_______'
left_df = pd.DataFrame({'key':['a','b','c'],'lval1':np.random.randn(3),'lval2':np.random.randn(3)})
right_df = pd.DataFrame({'key':list('bcd'),'rval1':np.random.randn(3),'rval2':np.randoom.randn(3)})
print(left_df)
print(split_str)
print(right_df)
1
2
3
4
5
6
7
8
9
  key     lval1     lval2
0 a -0.306740 0.370246
1 b -1.633727 -0.351369
2 c 1.558975 -0.179692
---------
key rval1 rval2
0 b -0.036699 0.724182
1 c -1.241680 -1.695795
2 d 1.580775 -1.271330

2. Compute the left outer join of df_left and df_right, check out the results

1
2
left_merge = pd.merge(left_df, right_df, how = 'left')
print(left_merge)
1
2
3
4
  key     lval1     lval2     rval1     rval2
0 a -0.306740 0.370246 NaN NaN
1 b -1.633727 -0.351369 -0.036699 0.724182
2 c 1.558975 -0.179692 -1.241680 -1.695795

3. Change the name “key” of df_left to “key_left”, re-run step 2 and see what happens

1
left_df.columns = ['key_left','lval1','lval2']

4. Compute the right outer join of df_left and df_right in step 2, check out the results

1
2
right_merge = pd.merge(left_df, right_df, how = 'right')
print(right_merge)
1
2
3
4
  key     lval1     lval2     rval1     rval2
0 b -1.633727 -0.351369 -0.036699 0.724182
1 c 1.558975 -0.179692 -1.241680 -1.695795
2 d NaN NaN 1.580775 -1.271330

5. Compute the full outer join of df_left and df_right in step 2, check out the results

1
2
outer_merge = pd.merge(left_df, right_df, how = 'outer')
print(outer_merge)
1
2
3
4
5
  key     lval1     lval2     rval1     rval2
0 a -0.306740 0.370246 NaN NaN
1 b -1.633727 -0.351369 -0.036699 0.724182
2 c 1.558975 -0.179692 -1.241680 -1.695795
3 d NaN NaN 1.580775 -1.271330

6. Compute the inner join of df_left and df_right in step 2, check out the results

1
2
inner_merge = pd.merge(left_df, right_df, how = 'inner')
print(inner_merge)
1
2
3
  key     lval1     lval2     rval1     rval2
0 b -1.633727 -0.351369 -0.036699 0.724182
1 c 1.558975 -0.179692 -1.241680 -1.695795

7. Get the floating value columns of df_left (lval1,lval2), get the square root of the absolute values using apply

1
2
3
4
5
6
7
8
9
print(left_df)
left_df_val = left_df[['lval1','lval2']]
print(left_df_val)

left_df_val_abs = left_df_val.apply(np.abs)
print(left_df_val_abs)

left_df_val_abs_sqrt = left_df_val_abs.apply(np.sqrt)
print(left_df_val_abs_sqrt)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
  key     lval1     lval2
0 a -0.306740 0.370246
1 b -1.633727 -0.351369
2 c 1.558975 -0.179692
lval1 lval2
0 -0.306740 0.370246
1 -1.633727 -0.351369
2 1.558975 -0.179692
lval1 lval2
0 0.306740 0.370246
1 1.633727 0.351369
2 1.558975 0.179692
lval1 lval2
0 0.553841 0.608478
1 1.278173 0.592764
2 1.248589 0.423901

8. Try using numpy to directly calculate the above operations on df_left

1
2
left_df_val_abs_sqrt_np = np.sqrt(np.abs(left_df_val))
print(left_df_val_abs_sqrt_np)
1
2
3
4
      lval1     lval2
0 0.553841 0.608478
1 1.278173 0.592764
2 1.248589 0.423901

9. Write the apply_map functions to accomplish step 7

1
2
3
4
5
6
7
print(left_df)
left_df_val = left_df[['lval1','lval2']]
print(left_df_val)
left_df_val_abs = left_df_val.applymap(np.abs)
print(left_df_val_abs)
left_df_val_abs_sqrt = left_df_val_abs.applymap(np.sqrt)
print(left_df_val_abs_sqrt)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
  key     lval1     lval2
0 a -0.306740 0.370246
1 b -1.633727 -0.351369
2 c 1.558975 -0.179692
lval1 lval2
0 -0.306740 0.370246
1 -1.633727 -0.351369
2 1.558975 -0.179692
lval1 lval2
0 0.306740 0.370246
1 1.633727 0.351369
2 1.558975 0.179692
lval1 lval2
0 0.553841 0.608478
1 1.278173 0.592764
2 1.248589 0.423901

10. Get the data of “Countries and dependencies by area” from wiki and save to the excel excluding index

1
2
3
4
5
6
7
8
9
import requests
url_wiki = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area'
r = requests.get(url_wiki,headers ={'User-Agent':'Mozilla/5.0 (Windows NT 10.0;Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124Safari/537.36'})

data = pd.read_html(r.text)
print([idata.shape for idata in data])
data_area = data[1]
print(data_area)
data_area.to_excel('area_info.xlsx',index = False)