Lab8 Pandas_1

Exam1

1. Create a 5-D random numpy list var_list

1 2	import numpy as np var_list = np.random.randn(5)

2. use “uuid” to generate 5 random keys (use str(uuid.uuid4())), and store them into a list key_list

1 2	import uuid key_list = [str(uuid.uuid4())[:6] for i in range(5)]

3. Create a dictionary dict from var_list and key_list

1	dict_tmp = {key_list[i]:var_list[i] for i in range(5)}

4. Create a Pandas Series from a) var_list b) var_list and key_list c) dict

import pandas as pd
pd_series_var = pd.Series(var_list)
pd_series_var_key = pd.Series(var_list,key_list)
pd_series_dict = pd.Series(dict_tmp)

5. Convert the Series back to the list and dictionary

1 2	var_list_new = pd_series_var_key.to_list() dict_new = pd_series_dict.to_dict()

6. Find out the elements larger than zero

1	pd_series_positive = pd.series_dict[pd_series_dict > 0]

7. Calculate the proportion of positive elements in the Series

1	proportion = len(pd_series_postive)/len(pd_series_dict)

8. Write down as many ways of forming a list that contains the values of Series elements

val_1 = pd_series_dict.to_list()
val_2 = []

for idx, ival in pd_series_dict.iteritems():
    val_2.append(ival)
val_3 = pd_series_dict.values
val_tmp = pd_series_dict.index
val_4 = [pd_series_dict[ikey] for ikey in pd_series_dict.index]

9. Calculate the proportion of elements that are larger than the mean value of the Series

1
2
3

mean_val = np.mean(pd_series_dict)
pd_series_larger_than_mean = pd_series_dict[pd_series_dict > mean_val]
proportion_2 = len(pd_series_larger_than_mean)/len(pd_series_dict)

Exam2

1. Write codes to create a random dict x which has 5 random keys and each key corresponds to a 6-D numpy array

import uuid
data = {}
key_list = []
for i in range(5):
    rand_key = str(uuid.uuid4())[:6]
    key_list.append(rand_key)
    data[rand_key] = np.random.randn(6)

2. Create a pandas dataframe using x

1 2	df = pd.DataFrame(data) print(df)

3. Create a pandas dataframe using a subset of x, in the subset of x, only keys that start with a digit are chosen

sub_key_list = [i_key for i_key in key_list if i_key[0] in '0123456789']
print(sub_key_list)
df_sub = pd.DataFrame(data,columns = sub_key_list)
print(df_sub)

4. Create a new pandas dataframe using the codes in the previous slide

1
2
3

dates = pd.data_range('1/1/2000',periods = 8)
df = pd.DataFrame(np.random.randn(8,4),index = dates, columns = ['A','B','C','D'])
print(df)

5. Select rows whose attribute A is smaller than the mean of attribute C

df_c = df['C']
print(df_c)
mean_c = np.mean(df_c)
print(mean_c)
print(df[df['A']<mean_c])

6. Can you select the column B and C using [] indexing? Try it out and see what happens

1 2	df_bc = df[['B','C']] print(df_bc)

Exam3

convert a Panda module Series to Python list

import pandas as pd

series = pd.Series([1, 2, 3, 4, 5])
print("Pandas Series and type")
print(series)
print(type(series))

print("Convert Pandas Series to Python list")
print(series.tolist())
print(type(series.tolist()))

Pandas Series and type
0    1
1    2
2    3
3    4
4    5
dtype: int64
<class 'pandas.core.series.Series'>
Convert Pandas Series to Python list
[1, 2, 3, 4, 5]
<class 'list'>

convert a dictionary to a Pandas series

import pandas as pd

dict = {'a': 100, 'b': 200, 'c':300, 'd':400, 'e':500}
print("Original dictionary:")
print(dict)

new_series = pd.Series(dict)
print("Converted series:")
print(new_series)

Original dictionary:
{'a': 100, 'b': 200, 'c': 300, 'd': 400, 'e': 500}
Converted series:
a    100
b    200
c    300
d    400
e    500
dtype: int64

convert a NumPy array to a Pandas series

import numpy as np
import pandas as pd

np_array = np.array([1, 2, 3, 4, 5])
print("NumPy array:")
print(np_array)

new_series = pd.Series(np_array)
print("Converted Pandas series:")
print(new_series)

NumPy array:
[1 2 3 4 5]
Converted Pandas series:
0    1
1    2
2    3
3    4
4    5
dtype: int32

convert the column of a DataFrame as a Series

import pandas as pd

d = {'col1': [1, 2, 3, 4, 7, 11], 'col2': [4, 5, 6, 9, 5, 0], 'col3': [7, 5, 8, 12, 1,11]}
df = pd.DataFrame(data=d)
print(type(df))

print("Original DataFrame")
print(df)
s1 = df.iloc[:,0]

print("\n1st column as a Series:")
print(s1)
print(type(s1))

<class 'pandas.core.frame.DataFrame'>
Original DataFrame
   col1  col2  col3
0     1     4     7
1     2     5     5
2     3     6     8
3     4     9    12
4     7     5     1
5    11     0    11

1st column as a Series:
0     1
1     2
2     3
3     4
4     7
5    11
Name: col1, dtype: int64
<class 'pandas.core.series.Series'>

Create a subset of a series

mport pandas as pd

s = pd.Series([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
print("Original Data Series:")
print(s)

print("\nSubset of the above Data Series:")
n = 10
new_s = s[s > n]
print(new_s)

Original Data Series:
0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
11    11
12    12
13    13
14    14
15    15
dtype: int64

Subset of the above Data Series:
11    11
12    12
13    13
14    14
15    15
dtype: int64

Lab9 Pandas_2

1 2	import numpy as np import pandas as pd

1. Write codes to create two DataFrames df_left, df_right, with the columns as “[key, lval1, lval2]” and “[key, rval1, rval2]”, and the values are “[a,b,c]”, and “[b,c,d]” respectively. Generate random numbers with normal distribution to for the “lval” and “rval” elements

split_str = '_______'
left_df = pd.DataFrame({'key':['a','b','c'],'lval1':np.random.randn(3),'lval2':np.random.randn(3)})
right_df = pd.DataFrame({'key':list('bcd'),'rval1':np.random.randn(3),'rval2':np.randoom.randn(3)})
print(left_df)
print(split_str)
print(right_df)

  key     lval1     lval2
0   a -0.306740  0.370246
1   b -1.633727 -0.351369
2   c  1.558975 -0.179692
---------
  key     rval1     rval2
0   b -0.036699  0.724182
1   c -1.241680 -1.695795
2   d  1.580775 -1.271330

2. Compute the left outer join of df_left and df_right, check out the results

1 2	left_merge = pd.merge(left_df, right_df, how = 'left') print(left_merge)

  key     lval1     lval2     rval1     rval2
0   a -0.306740  0.370246       NaN       NaN
1   b -1.633727 -0.351369 -0.036699  0.724182
2   c  1.558975 -0.179692 -1.241680 -1.695795

3. Change the name “key” of df_left to “key_left”, re-run step 2 and see what happens

1	left_df.columns = ['key_left','lval1','lval2']

4. Compute the right outer join of df_left and df_right in step 2, check out the results

1 2	right_merge = pd.merge(left_df, right_df, how = 'right') print(right_merge)

  key     lval1     lval2     rval1     rval2
0   b -1.633727 -0.351369 -0.036699  0.724182
1   c  1.558975 -0.179692 -1.241680 -1.695795
2   d       NaN       NaN  1.580775 -1.271330

5. Compute the full outer join of df_left and df_right in step 2, check out the results

1 2	outer_merge = pd.merge(left_df, right_df, how = 'outer') print(outer_merge)

  key     lval1     lval2     rval1     rval2
0   a -0.306740  0.370246       NaN       NaN
1   b -1.633727 -0.351369 -0.036699  0.724182
2   c  1.558975 -0.179692 -1.241680 -1.695795
3   d       NaN       NaN  1.580775 -1.271330

6. Compute the inner join of df_left and df_right in step 2, check out the results

1 2	inner_merge = pd.merge(left_df, right_df, how = 'inner') print(inner_merge)

1
2
3

  key     lval1     lval2     rval1     rval2
0   b -1.633727 -0.351369 -0.036699  0.724182
1   c  1.558975 -0.179692 -1.241680 -1.695795

7. Get the floating value columns of df_left (lval1,lval2), get the square root of the absolute values using apply

print(left_df)
left_df_val = left_df[['lval1','lval2']]
print(left_df_val)

left_df_val_abs = left_df_val.apply(np.abs)
print(left_df_val_abs)

left_df_val_abs_sqrt = left_df_val_abs.apply(np.sqrt)
print(left_df_val_abs_sqrt)

  key     lval1     lval2
0   a -0.306740  0.370246
1   b -1.633727 -0.351369
2   c  1.558975 -0.179692
      lval1     lval2
0 -0.306740  0.370246
1 -1.633727 -0.351369
2  1.558975 -0.179692
      lval1     lval2
0  0.306740  0.370246
1  1.633727  0.351369
2  1.558975  0.179692
      lval1     lval2
0  0.553841  0.608478
1  1.278173  0.592764
2  1.248589  0.423901

8. Try using numpy to directly calculate the above operations on df_left

1 2	left_df_val_abs_sqrt_np = np.sqrt(np.abs(left_df_val)) print(left_df_val_abs_sqrt_np)

      lval1     lval2
0  0.553841  0.608478
1  1.278173  0.592764
2  1.248589  0.423901

9. Write the apply_map functions to accomplish step 7

print(left_df)
left_df_val = left_df[['lval1','lval2']]
print(left_df_val)
left_df_val_abs = left_df_val.applymap(np.abs)
print(left_df_val_abs)
left_df_val_abs_sqrt = left_df_val_abs.applymap(np.sqrt)
print(left_df_val_abs_sqrt)

  key     lval1     lval2
0   a -0.306740  0.370246
1   b -1.633727 -0.351369
2   c  1.558975 -0.179692
      lval1     lval2
0 -0.306740  0.370246
1 -1.633727 -0.351369
2  1.558975 -0.179692
      lval1     lval2
0  0.306740  0.370246
1  1.633727  0.351369
2  1.558975  0.179692
      lval1     lval2
0  0.553841  0.608478
1  1.278173  0.592764
2  1.248589  0.423901

10. Get the data of “Countries and dependencies by area” from wiki and save to the excel excluding index

import requests
url_wiki = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area'
r = requests.get(url_wiki,headers ={'User-Agent':'Mozilla/5.0 (Windows NT 10.0;Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124Safari/537.36'})

data = pd.read_html(r.text)
print([idata.shape for idata in data])
data_area = data[1]
print(data_area)
data_area.to_excel('area_info.xlsx',index = False)

倬倬吃三碗

Pandas笔记(紧急ver.)

Lab8 Pandas_1

Exam1

Exam2

Exam3

Lab9 Pandas_2