0%

基于Spark的心脏病分析

一、项目开发目的

当今社会发展迅速,生活节奏也随着经济的发展越来越快,不少年轻人由于工作繁重同时缺少锻炼、饮食作息不规律,生活步入范亚健康化。腰腿疼痛、胃不舒服等症状我们平时都能明显地感受到并加以治疗。而心脏疾病却是难以捕捉,如果没有定期体检的习惯,很难直观地感受到心脏的潜在风险。心脏类疾病的发作迅速,后果严重。Framingham Heart Study(FHS)针对心脏疾病开展了长达70多年的跟踪研究。其中有一项长达26年的大规模研究显示,在全猝死患者中,心源性猝死高达75%。在现实中,我们也经常能看到关于年轻人猝死的新闻。所以本项目旨在对心脏病相关数据进行分析、建模,让大家对于心脏疾病更为了解、更加关注自身的身体健康。主要目的如下:1、对于心脏病相关数据集进行数据清洗、处理、特征筛选。2、使用分析得到的数据进行建模。3、通过网页将心脏病相关数据可视化呈现,并且支持通过输入个人信息查询心脏病患病风险。其中,我主要负责的是数据处理与清洗部分,本文对其他部分仅进行大致阐述。

工程文件:https://github.com/zhuozhuo233/Heart-disease-analysis

二、数据来源及相关解释

本次项目的数据集来自UCI开放数据集以及来自于Kaggle平台的公开数据集。

原始数据集解释如下:

Heart数据集

名词解释:

属性 解释 注释
age 年龄 29-77
sex 性别 1男、0女
cp 胸痛类型 0无症状、1典型心绞痛、2非典型心绞痛、3非心绞痛
trestbps 静息血压 94-200
chol 胆固醇(mg/dl) 126-564
fbs 空腹血糖>120mg/dl 0否、1是
restecg 静息心电图结果 0正常、1有异常、2左心室肥大
thalach 最大心率 71-202
exang 运动引起的心绞痛 0没有、1有过
oldpeak 运动相对于休息时引起的ST段下降 0-0.62
slope 最高运动ST段的斜率 0平、1升、2降
ca 使用荧光染色法测定的主血管数
thal 地中海贫血症状 1正常、2固定缺陷、3可逆缺陷
target 是否患有心脏病 0否、1是

Cardio数据集

属性 解释 注释
id 编号
age 年龄
gender 性别 1女、2男
height 身高(cm)
weight 体重(kg)
ap_hi 收缩压
ap_lo 舒张压
cholesterol 胆固醇 1正常、2高于正常、3远高于正常
gluc 血糖 1正常、2高于正常、3远高于正常
smoke 是否有吸烟习惯 0没有、1有
alco 是否有饮酒习惯 0没有、1有
active 是否有锻炼习惯 0没有、1有
cardio 是否存在心血管疾病 0没有、1有

三、工程文件解释

1.数据处理部分

Cardio工程

Partition:

将表中的数据按列分别存为json文件以供后续使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import org.apache.spark.sql.SparkSession

object partition {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("WARN")

val cardio = spark.read.table("cardio.cardio")

val id = cardio.select("id")
id.write.mode("overwrite").json("/user/root/cardio/id")

val age = cardio.select("age")
age.write.mode("overwrite").json("/user/root/cardio/age")

val gender = cardio.select("gender")
gender.write.mode("overwrite").json("/user/root/cardio/gender")

val ap_hi = cardio.select("ap_hi")
ap_hi.write.mode("overwrite").json("/user/root/cardio/ap_hi")

val ap_lo = cardio.select("ap_lo")
ap_lo.write.mode("overwrite").json("/user/root/cardio/ap_lo")

val cholesterol = cardio.select("cholesterol")
cholesterol.write.mode("overwrite").json("/user/root/cardio/cholesterol")

val gluc = cardio.select("gluc")
gluc.write.mode("overwrite").json("/user/root/cardio/gluc")

val smoke = cardio.select("smoke")
smoke.write.mode("overwrite").json("/user/root/cardio/smoke")

val alco = cardio.select("alco")
alco.write.mode("overwrite").json("/user/root/cardio/alco")

val active = cardio.select("active")
active.write.mode("overwrite").json("/user/root/cardio/active")

val cardio1 = cardio.select("cardio")
cardio1.write.mode("overwrite").json("/user/root/cardio/cardio1")
}

}

Ap:

​ 分析出心脏病患者中收缩压、舒张压正常的比例与心脏病患者收缩压与舒张压的数值分布情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import org.apache.spark.sql.SparkSession


object ap {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("WARN")

val cardio = spark.read.table("cardio.cardio")

val ap_hi = cardio.select("ap_hi").filter("cardio = '1'").count()
val ap_hi_1 = cardio.select("ap_hi").filter("cardio = '1'").filter("ap_hi >= '90' and ap_hi <= '140'").count()
val ap_hiresult = (ap_hi_1.toFloat/ap_hi).formatted("%.2f").toFloat
println("心脏病患者中收缩压正常的比率:")
rate(ap_hiresult)

val ap_lo = cardio.select("ap_lo").filter("cardio = '1'").count()
val ap_lo_1 = cardio.select("ap_lo").filter("cardio = '1'").filter("ap_lo >= '60' and ap_hi <= '90'").count()
val ap_loresult = (ap_lo_1.toFloat/ap_lo).formatted("%.2f").toFloat
println("心脏病患者中舒张压正常的比率:")
rate(ap_loresult)

val ap_hi100_110 = cardio.filter("cardio = '1'").filter("ap_hi >= '100' and ap_hi <='110'").count()
val result100_110 = (ap_hi100_110.toFloat/ap_hi).formatted("%.2f").toFloat
println("收缩压在100~110mmHg:")
rate(result100_110)

val ap_hi110_120 = cardio.filter("cardio = '1'").filter("ap_hi >= '111' and ap_hi <='120'").count()
val result110_120 = (ap_hi110_120.toFloat/ap_hi).formatted("%.2f").toFloat
println("收缩压在110~120mmHg:")
rate(result110_120)

val ap_hi120_130 = cardio.filter("cardio = '1'").filter("ap_hi >= '121' and ap_hi <='130'").count()
val result120_130 = (ap_hi120_130.toFloat/ap_hi).formatted("%.2f").toFloat
println("收缩压在120~130mmHg:")
rate(result120_130)

val ap_hi130_140 = cardio.filter("cardio = '1'").filter("ap_hi >= '131' and ap_hi <='140'").count()
val result130_140 = (ap_hi130_140.toFloat/ap_hi).formatted("%.2f").toFloat
println("收缩压在130~140mmHg:")
rate(result130_140)

val ap_hi140_150 = cardio.filter("cardio = '1'").filter("ap_hi >= '141' and ap_hi <='150'").count()
val result140_150 = (ap_hi140_150.toFloat/ap_hi).formatted("%.2f").toFloat
println("收缩压在140~150mmHg:")
rate(result140_150)

val ap_hi150_160 = cardio.filter("cardio = '1'").filter("ap_hi >= '151' and ap_hi <='160'").count()
val result150_160 = (ap_hi150_160.toFloat/ap_hi).formatted("%.2f").toFloat
println("收缩压在150~160mmHg:")
rate(result150_160)

val ap_hi160_170 = cardio.filter("cardio = '1'").filter("ap_hi >= '161' and ap_hi <='170'").count()
val result160_170 = (ap_hi160_170.toFloat/ap_hi).formatted("%.2f").toFloat
println("收缩压在160~170mmHg:")
rate(result160_170)

val ap_hi170_180 = cardio.filter("cardio = '1'").filter("ap_hi >= '171' and ap_hi <='180'").count()
val result170_180 = (ap_hi170_180.toFloat/ap_hi).formatted("%.2f").toFloat
println("收缩压在170~180mmHg:")
rate(result170_180)

val ap_hi180_190 = cardio.filter("cardio = '1'").filter("ap_hi >= '181' and ap_hi <='190'").count()
val result180_190 = (ap_hi180_190.toFloat/ap_hi).formatted("%.2f").toFloat
println("收缩压在180~190mmHg:")
rate(result180_190)

val ap_hi190_200 = cardio.filter("cardio = '1'").filter("ap_hi >= '191' and ap_hi <='200'").count()
val result190_200 = (ap_hi190_200.toFloat/ap_hi).formatted("%.2f").toFloat
println("收缩压在190~200mmHg:")
rate(result190_200)
// ----------------舒张压-------------------------------
val ap_lo60_70 = cardio.filter("cardio = '1'").filter("ap_lo >= '60' and ap_lo <='70'").count()
val result60_70 = (ap_lo60_70.toFloat/ap_lo).formatted("%.2f").toFloat
println("舒张压在60~70mmHg:")
rate(result60_70)

val ap_lo70_80 = cardio.filter("cardio = '1'").filter("ap_lo >= '71' and ap_lo <='80'").count()
val result70_80 = (ap_lo70_80.toFloat/ap_lo).formatted("%.2f").toFloat
println("舒张压在70~80mmHg:")
rate(result70_80)

val ap_lo80_90 = cardio.filter("cardio = '1'").filter("ap_lo >= '81' and ap_lo <='90'").count()
val result80_90 = (ap_lo80_90.toFloat/ap_lo).formatted("%.2f").toFloat
println("舒张压在80~90mmHg:")
rate(result80_90)

val ap_lo90_100 = cardio.filter("cardio = '1'").filter("ap_lo >= '91' and ap_lo <='100'").count()
val result90_100 = (ap_lo90_100.toFloat/ap_lo).formatted("%.2f").toFloat
println("舒张压在90~100mmHg:")
rate(result90_100)

val ap_lo100_110 = cardio.filter("cardio = '1'").filter("ap_lo >= '101' and ap_lo <='110'").count()
val resultlo100_110 = (ap_lo100_110.toFloat/ap_lo).formatted("%.2f").toFloat
println("舒张压在100~110mmHg:")
rate(resultlo100_110)

val ap_lo110_120 = cardio.filter("cardio = '1'").filter("ap_lo >= '111' and ap_lo <='120'").count()
val resultlo110_120 = (ap_lo110_120.toFloat/ap_lo).formatted("%.2f").toFloat
println("舒张压在110~120mmHg:")
rate(resultlo110_120)
}

def rate(x:Float):Unit ={
var per = (x * 100).toInt
println(per)
}

}

Bmi:

​ 找出性别与相对的bmi数据,生成为json文件以供后续使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import org.apache.spark.sql.SparkSession



object bmi {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("WARN")

val cardio = spark.read.table("cardio.cardio3")

val bmi = cardio.select("gender","bmi")
bmi.write.mode("overwrite").json("/user/root/cardio/age_bmi")


}

}

ageprocess:

​ 从年龄入手,寻找年龄的最值,并且分析出不同年龄段患病的比例并生成为json文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import org.apache.spark.sql.SparkSession

object ageprocess {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("WARN")

val cardio = spark.read.table("cardio.cardio")

cardio.selectExpr("max(age) as max_age").show()
cardio.selectExpr("min(age) as min").show()

val agetotal = cardio.select("age").count()


val age30_35 = cardio.filter("age >= '30' and age <='35'").filter("cardio = '1'").count()
val result30_35 = (age30_35.toFloat/agetotal).formatted("%.2f").toFloat
println("30到35岁比率:")
rate(result30_35)

val age36_40 = cardio.filter("age >= '36' and age <='40'").filter("cardio = '1'").count()
val result30_40 = (age36_40.toFloat/agetotal).formatted("%.2f").toFloat
println("36到40岁比率:")
rate(result30_40)

val age41_45 = cardio.filter("age >= '41' and age <='45'").filter("cardio = '1'").count()
val result41_45 = (age41_45.toFloat/agetotal).formatted("%.2f").toFloat
println("41到45岁比率:")
rate(result41_45)

val age46_50 = cardio.filter("age >= '46' and age <='50'").filter("cardio = '1'").count()
val result46_50 = (age46_50.toFloat/agetotal).formatted("%.2f").toFloat
println("46到50岁比率:")
rate(result46_50)

val age51_55 = cardio.filter("age >= '51' and age <='55'").filter("cardio = '1'").count()
val result51_55 = (age51_55.toFloat/agetotal).formatted("%.2f").toFloat
println("51到55岁比率:")
rate(result51_55)

val age56_60 = cardio.filter("age >= '56' and age <='60'").filter("cardio = '1'").count()
val result56_60 = (age56_60.toFloat/agetotal).formatted("%.2f").toFloat
println("56到60岁比率:")
rate(result56_60)

val age61_65 = cardio.filter("age >= '61' and age <='65'").filter("cardio = '1'").count()
val result61_65 = (age61_65.toFloat/agetotal).formatted("%.2f").toFloat
println("60到65岁比率:")
rate(result61_65)

}
def rate(x:Float):Unit ={
var per = (x * 100).toInt
println(per)
}
}

Exam:

​ 找出数据集中心脏病患者胆固醇与正常值的比较情况并打印

​ 找出数据集中心脏病患者血糖与正常值的比较情况并打印

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import org.apache.spark.sql.SparkSession

object exam {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("WARN")

val cardio = spark.read.table("cardio.cardio")

val cho = cardio.select("cholesterol").filter("cardio = '1'").count()
val cho1 = cardio.select("cholesterol").filter("cholesterol = '1'").filter("cardio = '1'").count()
val cho2 = cardio.select("cholesterol").filter("cholesterol = '2'").filter("cardio = '1'").count()
val cho3 = cardio.select("cholesterol").filter("cholesterol = '3'").filter("cardio = '1'").count()

val smokeresult1 = (cho1.toFloat/cho).formatted("%.2f").toFloat
println("心脏病患者中胆固醇数值等于正常的比例:")
rate(smokeresult1)

val smokeresult2 = (cho2.toFloat/cho).formatted("%.2f").toFloat
println("心脏病患者中胆固醇数值高于正常的比例:")
rate(smokeresult2)

val smokeresult3 = (cho3.toFloat/cho).formatted("%.2f").toFloat
println("心脏病患者中胆固醇数值远高于正常的比例:")
rate(smokeresult3)

val gluc = cardio.select("gluc").filter("cardio = '1'").count()
val gluc1 = cardio.select("gluc").filter("gluc = '1'").filter("cardio = '1'").count()
val gluc2 = cardio.select("gluc").filter("gluc = '2'").filter("cardio = '1'").count()
val gluc3 = cardio.select("gluc").filter("gluc = '3'").filter("cardio = '1'").count()

val glucresult1 = (gluc1.toFloat/gluc).formatted("%.2f").toFloat
println("心脏病患者中血糖数值等于正常的比例:")
rate(glucresult1)

val glucresult2 = (gluc2.toFloat/gluc).formatted("%.2f").toFloat
println("心脏病患者中血糖数值高于正常的比例:")
rate(glucresult2)

val glucresult3 = (gluc3.toFloat/gluc).formatted("%.2f").toFloat
println("心脏病患者中血糖数值远高于正常的比例:")
rate(glucresult3)




}
def rate(x:Float):Unit ={
var per = (x * 100).toInt
println(per)
}


}

Hobbys:

​ 计算出心脏病患者中有吸烟、饮酒、锻炼习惯的比例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import org.apache.spark.sql.SparkSession

object hobbys {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("WARN")

val cardio = spark.read.table("cardio.cardio")

val smoke = cardio.select("smoke").filter("cardio = '1'").count()
val issmoke = cardio.select("smoke").filter("smoke = '1'").filter("cardio = '1'").count()
val smokeresult = (issmoke.toFloat/smoke).formatted("%.2f").toFloat
println("心脏病患者中有吸烟习惯的比例:")
rate(smokeresult)

val alco = cardio.select("alco").filter("cardio = '1'").count()
val isalco = cardio.select("alco").filter("alco = '1'").filter("cardio = '1'").count()
val alcoresult = (isalco.toFloat/alco).formatted("%.2f").toFloat
println("心脏病患者中有饮酒习惯的比例:")
rate(alcoresult)

val active = cardio.select("active").filter("cardio = '1'").count()
val isactive = cardio.select("active").filter("active = '1'").filter("cardio = '1'").count()
val activeresult = (isactive.toFloat/active).formatted("%.2f").toFloat
println("心脏病患者中有锻炼习惯的比例:")
rate(activeresult)



}
def rate(x:Float):Unit ={
var per = (x * 100).toInt
println(per)
}

}

Ifcardio:

​ 找出数据集中年龄与是否患病的信息,生成为json文件以供后续使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import org.apache.spark.sql.SparkSession

object ifcardio {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("WARN")

val cardio = spark.read.table("cardio.cardio3")

val cardio1 = cardio.select("age", "cardio")
cardio1.write.mode("overwrite").json("/user/root/cardio/age_cardio")
}

}

Heart工程

Ageprocess:

​ 根据数据集分析并计算出不同年龄段的患病风险并打印

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import org.apache.spark.sql.SparkSession

object ageprocess {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("WARN")

val heart = spark.read.table("heart.heart")

// val age29_40 = heart.filter("age >= '29' and age <= '40'").show()
val age29_40_1 = heart.filter("age >= '29' and age <= '40'").filter("target = '1'").count()
val age29_40_2 = heart.filter("age >= '29' and age <= '40'").count()
val result29_40 = (age29_40_1.toFloat/age29_40_2).formatted("%.2f").toFloat
println("29到40岁患病风险:")
baifenshu(result29_40)

val age41_50_1 = heart.filter("age >= '41' and age <= '50'").filter("target = '1'").count()
val age41_50_2 = heart.filter("age >= '41' and age <= '50'").count()
val result41_50 = (age41_50_1.toFloat/age41_50_2).formatted("%.2f").toFloat
println("41到50岁患病风险:")
baifenshu(result41_50)

val age51_60_1 = heart.filter("age >= '51' and age <= '60'").filter("target = '1'").count()
val age51_60_2 = heart.filter("age >= '51' and age <= '60'").count()
val result51_60 = (age51_60_1.toFloat/age51_60_2).formatted("%.2f").toFloat
println("51到60岁患病风险:")
baifenshu(result51_60)

val age61_70_1 = heart.filter("age >= '61' and age <= '70'").filter("target = '1'").count()
val age61_70_2 = heart.filter("age >= '61' and age <= '70'").count()
val result61_70 = (age61_70_1.toFloat/age61_70_2).formatted("%.2f").toFloat
println("61到70岁患病风险:")
baifenshu(result61_70)

val age71_77_1 = heart.filter("age >= '71' and age <= '77'").filter("target = '1'").count()
val age71_77_2 = heart.filter("age >= '71' and age <= '77'").count()
val result71_77 = (age71_77_1.toFloat/age71_77_2).formatted("%.2f").toFloat
println("70到71岁患病风险:")
baifenshu(result71_77)
}

def baifenshu(x: Float ): Unit ={
var per = (x * 100).toInt
println(per + "%")
}
}

Cpprocess:

​ 选取出数据集中年龄,胸痛类型、是否患病标签写入新的json文件以供后续分析其关联

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import org.apache.spark.sql.SparkSession


object cpprocess {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("WARN")

val heart = spark.read.table("heart.heart")

val cp = heart.select("age","cp","target")
cp.write.mode("overwrite").json("/user/root/heart/age_cp_target")
}
}

Exangprocess:

​ 选取出数据集中是否有过运动引起的心绞痛与是否患病特征,写入新的json文件以供后续分析其关联使用。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import org.apache.spark.sql.SparkSession

object exangprocess {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("WARN")

val heart = spark.read.table("heart.heart")

val exang = heart.select("exang","target")
exang.write.mode("overwrite").json("/user/root/heart/exang_target")
}
}

Thalach_target:

​ 选取出数据集中最大心率与是否患病数据,写入新的文件以供后续分析其关联使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import org.apache.spark.sql.SparkSession

object thalach_target {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("WARN")

val heart = spark.read.table("heart.heart")

val thalach = heart.select("thalach","target")
thalach.write.mode("overwrite").json("/user/root/heart/thalach_target")
}
}

Thalachprocess:

​ 将最高心率进行分段统计分析,打印出在分段内的统计人数与患病概率

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import org.apache.spark.sql.SparkSession

object thalachprocess {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("WARN")

val heart = spark.read.table("heart.heart")

// val age29_40 = heart.filter("age >= '29' and age <= '40'").show()
val thalach71_90_1 = heart.filter("thalach >= '71' and thalach <= '90'").filter("target = '1'").count()
val thalach71_90_2 = heart.filter("thalach >= '71' and thalach <= '90'").count()
val thalach71_90 = (thalach71_90_1.toFloat/thalach71_90_2).formatted("%.2f").toFloat
println("最高心率在71到90每分钟的患病概率:")
baifenshu(thalach71_90)
println("最高心率在71到90每分钟的统计人数:")
println(thalach71_90_2)


val thalach91_110_1 = heart.filter("thalach >= '91' and thalach <= '110'").filter("target = '1'").count()
val thalach91_110_2 = heart.filter("thalach >= '91' and thalach <= '110'").count()
val thalach91_110 = (thalach91_110_1.toFloat/thalach91_110_2).formatted("%.2f").toFloat
println("最高心率在91到110每分钟的患病概率:")
baifenshu(thalach91_110)
println("最高心率在91到110每分钟的统计人数:")
println(thalach91_110_2)

val thalach111_130_1 = heart.filter("thalach >= '111' and thalach <= '130'").filter("target = '1'").count()
val thalach111_130_2 = heart.filter("thalach >= '111' and thalach <= '130'").count()
val thalach111_130 = (thalach111_130_1.toFloat/thalach111_130_2).formatted("%.2f").toFloat
println("最高心率在111到130每分钟的患病概率:")
baifenshu(thalach111_130)
println("最高心率在111到130每分钟的统计人数:")
println(thalach111_130_2)

val thalach131_150_1 = heart.filter("thalach >= '131' and thalach <= '150'").filter("target = '1'").count()
val thalach131_150_2 = heart.filter("thalach >= '131' and thalach <= '150'").count()
val thalach131_150 = (thalach131_150_1.toFloat/thalach131_150_2).formatted("%.2f").toFloat
println("最高心率在131到150每分钟的患病概率:")
baifenshu(thalach131_150)
println("最高心率在131到150每分钟的统计人数:")
println(thalach131_150_2)

val thalach151_170_1 = heart.filter("thalach >= '151' and thalach <= '170'").filter("target = '1'").count()
val thalach151_170_2 = heart.filter("thalach >= '151' and thalach <= '170'").count()
val thalach151_170 = (thalach151_170_1.toFloat/thalach151_170_2).formatted("%.2f").toFloat
println("最高心率在151到170每分钟的患病概率:")
baifenshu(thalach151_170)
println("最高心率在151到170每分钟的统计人数:")
println(thalach151_170_2)

val thalach171_190_1 = heart.filter("thalach >= '171' and thalach <= '190'").filter("target = '1'").count()
val thalach171_190_2 = heart.filter("thalach >= '171' and thalach <= '190'").count()
val thalach171_190 = (thalach171_190_1.toFloat/thalach171_190_2).formatted("%.2f").toFloat
println("最高心率在171到190每分钟的患病概率:")
baifenshu(thalach171_190)
println("最高心率在171到190每分钟的统计人数:")
println(thalach171_190_2)

val thalach191_202_1 = heart.filter("thalach >= '191' and thalach <= '202'").filter("target = '1'").count()
val thalach191_202_2 = heart.filter("thalach >= '191' and thalach <= '202'").count()
val thalach191_202 = (thalach191_202_1.toFloat/thalach171_190_2).formatted("%.2f").toFloat
println("最高心率在191到202每分钟的患病概率:")
baifenshu(thalach191_202)
println("最高心率在191到202每分钟的统计人数:")
println(thalach191_202_2)
}
def baifenshu(x: Float ): Unit ={
var per = (x * 100).toInt
println(per + "%")
}
}

四、数据预处理及基本分析部分

HDFS数据库一览

HDFS文件存储一览

/user/root

部分项目提交集群运行结果展示

1
spark-submit --master spark://master:7077 --class test.ageprocess /opt/cardio.jar

1
spark-submit --master spark://master:7077 --class test.hobbys /opt/cardio.jar

1
spark-submit --master spark://master:7077 --class test.exam /opt/cardio.jar

1
spark-submit --master spark://master:7077 --class test.ap /opt/cardio.jar

部分生成的json文件展示

/user/root/cardio

/user/root/heart

五、关于数据建模的解释以及思考

原本希望通过数据建模完成心脏病预测系统:输入数据——>模型比对分析——>输出患心血管疾病的概率

经过数周的思考与对于技术的深入了解,我们认为我们的模型训练仅仅从数字入手,按照目前的知识技术进行是不具备足够的科学性的。真正的心血管疾病预测需要更大量的数据参与训练以及更严格、准确的指标标准。想要完成真正的疾病预测,少不了医疗专业领域的专家参与。遂将我们的目标改为心血管分析可视化,预测系统暂时搁置、希望将来有机会能把这个项目完善。

六、数据可视化展示

组件支持

  • HTML

  • CSS

  • JavaScript

  • JQuery

  • Ajax

  • Echarts

主显界面

预测交互界面(demo)

附加的bmi计算小功能(demo)