拓端tecdat|R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析

数字踏月鹤
• 阅读 1679

原文链接:http://tecdat.cn/?p=5521

 

Data background

A telephone company is interested in determining which customer characteristics are useful for predicting churn, customers who will leave their service. 

The data set  is Churn . The fields are as follows:

 

State

 discrete.

account length

 continuous.

area code

 continuous.

phone number

 discrete.

international plan

 discrete.

voice mail plan

 discrete.

number vmail messages

 continuous.

total day minutes

 continuous.

total day calls

 continuous.

total day charge

 continuous.

total eve minutes

 continuous.

total eve calls

 continuous.

total eve charge

 continuous.

total night minutes

 continuous.

total night calls

 continuous.

total night charge

 continuous.

total intl minutes

 continuous.

total intl calls

 continuous.

total intl charge

 continuous.

number customer service calls

 continuous.

churn

 Discrete

Data Preparation and Exploration 

 

  1. 查看数据概览
  2. \## state account.length area.code phone.number
  3. \## WV : 158 Min. : 1.0 Min. :408.0 327-1058: 1
  4. \## MN : 125 1st Qu.: 73.0 1st Qu.:408.0 327-1319: 1
  5. \## AL : 124 Median :100.0 Median :415.0 327-2040: 1
  6. \## ID : 119 Mean :100.3 Mean :436.9 327-2475: 1
  7. \## VA : 118 3rd Qu.:127.0 3rd Qu.:415.0 327-3053: 1
  8. \## OH : 116 Max. :243.0 Max. :510.0 327-3587: 1
  9. \## (Other):4240 (Other) :4994
  10. \## international.plan voice.mail.plan number.vmail.messages
  11. \## no :4527 no :3677 Min. : 0.000
  12. \## yes: 473 yes:1323 1st Qu.: 0.000
  13. \## Median : 0.000
  14. \## Mean : 7.755
  15. \## 3rd Qu.:17.000
  16. \## Max. :52.000
  17. \## total.day.minutes total.day.calls total.day.charge total.eve.minutes
  18. \## Min. : 0.0 Min. : 0 Min. : 0.00 Min. : 0.0
  19. \## 1st Qu.:143.7 1st Qu.: 87 1st Qu.:24.43 1st Qu.:166.4
  20. \## Median :180.1 Median :100 Median :30.62 Median :201.0
  21. \## Mean :180.3 Mean :100 Mean :30.65 Mean :200.6
  22. \## 3rd Qu.:216.2 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1
  23. \## Max. :351.5 Max. :165 Max. :59.76 Max. :363.7
  24. \## total.eve.calls total.eve.charge total.night.minutes total.night.calls
  25. \## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00
  26. \## 1st Qu.: 87.0 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00
  27. \## Median :100.0 Median :17.09 Median :200.4 Median :100.00
  28. \## Mean :100.2 Mean :17.05 Mean :200.4 Mean : 99.92
  29. \## 3rd Qu.:114.0 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00
  30. \## Max. :170.0 Max. :30.91 Max. :395.0 Max. :175.00
  31. \## total.night.charge total.intl.minutes total.intl.calls total.intl.charge
  32. \## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. :0.000
  33. \## 1st Qu.: 7.510 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
  34. \## Median : 9.020 Median :10.30 Median : 4.000 Median :2.780
  35. \## Mean : 9.018 Mean :10.26 Mean : 4.435 Mean :2.771
  36. \## 3rd Qu.:10.560 3rd Qu.:12.00 3rd Qu.: 6.000 3rd Qu.:3.240
  37. \## Max. :17.770 Max. :20.00 Max. :20.000 Max. :5.400
  38. \## number.customer.service.calls churn
  39. \## Min. :0.00 False.:4293
  40. \## 1st Qu.:1.00 True. : 707
  41. \## Median :1.00
  42. \## Mean :1.57
  43. \## 3rd Qu.:2.00
  44. \## Max. :9.00

 从数据概览中我们可以发现没有缺失数据,同时可以发现电话号 地区代码是没有价值的变量,可以删去

 

Examine the variables graphically

 

   拓端tecdat|R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析拓端tecdat|R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析拓端tecdat|R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析

从上面的结果中,我们可以看到churn为no的样本数目要远远大于churn为yes的样本,因此所有样本中churn占多数。

 

拓端tecdat|R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析

拓端tecdat|R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析

拓端tecdat|R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析

从上面的结果中,我们可以看到除了emailcode和areacode之外,其他数值变量近似符合正态分布。

  1. \## account.length area.code number.vmail.messages total.day.minutes
  2. \## Min. : 1.0 Min. :408.0 Min. : 0.000 Min. : 0.0
  3. \## 1st Qu.: 73.0 1st Qu.:408.0 1st Qu.: 0.000 1st Qu.:143.7
  4. \## Median :100.0 Median :415.0 Median : 0.000 Median :180.1
  5. \## Mean :100.3 Mean :436.9 Mean : 7.755 Mean :180.3
  6. \## 3rd Qu.:127.0 3rd Qu.:415.0 3rd Qu.:17.000 3rd Qu.:216.2
  7. \## Max. :243.0 Max. :510.0 Max. :52.000 Max. :351.5
  8. \## total.day.calls total.day.charge total.eve.minutes total.eve.calls
  9. \## Min. : 0 Min. : 0.00 Min. : 0.0 Min. : 0.0
  10. \## 1st Qu.: 87 1st Qu.:24.43 1st Qu.:166.4 1st Qu.: 87.0
  11. \## Median :100 Median :30.62 Median :201.0 Median :100.0
  12. \## Mean :100 Mean :30.65 Mean :200.6 Mean :100.2
  13. \## 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1 3rd Qu.:114.0
  14. \## Max. :165 Max. :59.76 Max. :363.7 Max. :170.0
  15. \## total.eve.charge total.night.minutes total.night.calls total.night.charge
  16. \## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.000
  17. \## 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00 1st Qu.: 7.510
  18. \## Median :17.09 Median :200.4 Median :100.00 Median : 9.020
  19. \## Mean :17.05 Mean :200.4 Mean : 99.92 Mean : 9.018
  20. \## 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00 3rd Qu.:10.560
  21. \## Max. :30.91 Max. :395.0 Max. :175.00 Max. :17.770
  22. \## total.intl.minutes total.intl.calls total.intl.charge
  23. \## Min. : 0.00 Min. : 0.000 Min. :0.000
  24. \## 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
  25. \## Median :10.30 Median : 4.000 Median :2.780
  26. \## Mean :10.26 Mean : 4.435 Mean :2.771
  27. \## 3rd Qu.:12.00 3rd Qu.: 6.000 3rd Qu.:3.240
  28. \## Max. :20.00 Max. :20.000 Max. :5.400
  29. \## number.customer.service.calls
  30. \## Min. :0.00
  31. \## 1st Qu.:1.00
  32. \## Median :1.00
  33. \## Mean :1.57
  34. \## 3rd Qu.:2.00
  35. \## Max. :9.00

Relationships between variables

拓端tecdat|R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析

从结果中我们可以看到两者之间存在显著的正相关线性关系。

拓端tecdat|R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析
 

Using the statistics node, report

  1. \## account.length area.code
  2. \## account.length 1.0000000000 -0.018054187
  3. \## area.code -0.0180541874 1.000000000
  4. \## number.vmail.messages -0.0145746663 -0.003398983
  5. \## total.day.minutes -0.0010174908 -0.019118245
  6. \## total.day.calls 0.0282402279 -0.019313854
  7. \## total.day.charge -0.0010191980 -0.019119256
  8. \## total.eve.minutes -0.0095913331 0.007097877
  9. \## total.eve.calls 0.0091425790 -0.012299947
  10. \## total.eve.charge -0.0095873958 0.007114130
  11. \## total.night.minutes 0.0006679112 0.002083626
  12. \## total.night.calls -0.0078254785 0.014656846
  13. \## total.night.charge 0.0006558937 0.002070264
  14. \## total.intl.minutes 0.0012908394 -0.004153729
  15. \## total.intl.calls 0.0142772733 -0.013623309
  16. \## total.intl.charge 0.0012918112 -0.004219099
  17. \## number.customer.service.calls -0.0014447918 0.020920513
  18. \## number.vmail.messages total.day.minutes
  19. \## account.length -0.0145746663 -0.001017491
  20. \## area.code -0.0033989831 -0.019118245
  21. \## number.vmail.messages 1.0000000000 0.005381376
  22. \## total.day.minutes 0.0053813760 1.000000000
  23. \## total.day.calls 0.0008831280 0.001935149
  24. \## total.day.charge 0.0053767959 0.999999951
  25. \## total.eve.minutes 0.0194901208 -0.010750427
  26. \## total.eve.calls -0.0039543728 0.008128130
  27. \## total.eve.charge 0.0194959757 -0.010760022
  28. \## total.night.minutes 0.0055413838 0.011798660
  29. \## total.night.calls 0.0026762202 0.004236100
  30. \## total.night.charge 0.0055349281 0.011782533
  31. \## total.intl.minutes 0.0024627018 -0.019485746
  32. \## total.intl.calls 0.0001243302 -0.001303123
  33. \## total.intl.charge 0.0025051773 -0.019414797
  34. \## number.customer.service.calls -0.0070856427 0.002732576
  35. \## total.day.calls total.day.charge
  36. \## account.length 0.0282402279 -0.001019198
  37. \## area.code -0.0193138545 -0.019119256
  38. \## number.vmail.messages 0.0008831280 0.005376796
  39. \## total.day.minutes 0.0019351487 0.999999951
  40. \## total.day.calls 1.0000000000 0.001935884
  41. \## total.day.charge 0.0019358844 1.000000000
  42. \## total.eve.minutes -0.0006994115 -0.010747297
  43. \## total.eve.calls 0.0037541787 0.008129319
  44. \## total.eve.charge -0.0006952217 -0.010756893
  45. \## total.night.minutes 0.0028044650 0.011801434
  46. \## total.night.calls -0.0083083467 0.004234934
  47. \## total.night.charge 0.0028018169 0.011785301
  48. \## total.intl.minutes 0.0130972198 -0.019489700
  49. \## total.intl.calls 0.0108928533 -0.001306635
  50. \## total.intl.charge 0.0131613976 -0.019418755
  51. \## number.customer.service.calls -0.0107394951 0.002726370
  52. \## total.eve.minutes total.eve.calls
  53. \## account.length -0.0095913331 0.009142579
  54. \## area.code 0.0070978766 -0.012299947
  55. \## number.vmail.messages 0.0194901208 -0.003954373
  56. \## total.day.minutes -0.0107504274 0.008128130
  57. \## total.day.calls -0.0006994115 0.003754179
  58. \## total.day.charge -0.0107472968 0.008129319
  59. \## total.eve.minutes 1.0000000000 0.002763019
  60. \## total.eve.calls 0.0027630194 1.000000000
  61. \## total.eve.charge 0.9999997749 0.002778097
  62. \## total.night.minutes -0.0166391160 0.001781411
  63. \## total.night.calls 0.0134202163 -0.013682341
  64. \## total.night.charge -0.0166420421 0.001799380
  65. \## total.intl.minutes 0.0001365487 -0.007458458
  66. \## total.intl.calls 0.0083881559 0.005574500
  67. \## total.intl.charge 0.0001593155 -0.007507151
  68. \## number.customer.service.calls -0.0138234228 0.006234831
  69. \## total.eve.charge total.night.minutes
  70. \## account.length -0.0095873958 0.0006679112
  71. \## area.code 0.0071141298 0.0020836263
  72. \## number.vmail.messages 0.0194959757 0.0055413838
  73. \## total.day.minutes -0.0107600217 0.0117986600
  74. \## total.day.calls -0.0006952217 0.0028044650
  75. \## total.day.charge -0.0107568931 0.0118014339
  76. \## total.eve.minutes 0.9999997749 -0.0166391160
  77. \## total.eve.calls 0.0027780971 0.0017814106
  78. \## total.eve.charge 1.0000000000 -0.0166489191
  79. \## total.night.minutes -0.0166489191 1.0000000000
  80. \## total.night.calls 0.0134220174 0.0269718182
  81. \## total.night.charge -0.0166518367 0.9999992072
  82. \## total.intl.minutes 0.0001320238 -0.0067209669
  83. \## total.intl.calls 0.0083930603 -0.0172140162
  84. \## total.intl.charge 0.0001547783 -0.0066545873
  85. \## number.customer.service.calls -0.0138363623 -0.0085325365
 
如果把高相关性的变量保存下来,可能会造成多重共线性问题,因此需要把高相关关系的变量删去。

Data Manipulation

拓端tecdat|R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析

 
从结果中可以看到,total.day.calls和total.day.charge之间存在一定的相关关系。
特别是voicemial为no的变量之间存在负相关关系。

 

Discretize (make categorical) a relevant numeric variable

 

拓端tecdat|R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析

 

 

对变量进行离散化

 

construct a distribution of the variable with a churn overlay

拓端tecdat|R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析

construct a histogram of the variable with a churn overlay

 

拓端tecdat|R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析

拓端tecdat|R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析

 

拓端tecdat|R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析

Find a pair of numeric variables which are interesting with respect to churn.

拓端tecdat|R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析

 
从结果中可以看到,total.day.calls和total.day.charge之间存在一定的相关关系。
 

Model Building

特别是churn为no的变量之间存在相关关系。
 

  1. \## Estimate Std. Error t value Pr(>|t|)
  2. \## (Intercept) 0.3082150 0.0735760 4.189 2.85e-05 ***
  3. \## stateAL 0.0151188 0.0462343 0.327 0.743680
  4. \## stateAR 0.0894792 0.0490897 1.823 0.068399 .
  5. \## stateAZ 0.0329566 0.0494195 0.667 0.504883
  6. \## stateCA 0.1951511 0.0567439 3.439 0.000588 ***
  7. \## international.plan yes 0.3059341 0.0151677 20.170 < 2e-16 ***
  8. \## voice.mail.plan yes -0.1375056 0.0337533 -4.074 4.70e-05 ***
  9. \## number.vmail.messages 0.0017068 0.0010988 1.553 0.120402
  10. \## total.day.minutes 0.3796323 0.2629027 1.444 0.148802
  11. \## total.day.calls 0.0002191 0.0002235 0.981 0.326781
  12. \## total.day.charge -2.2207671 1.5464583 -1.436 0.151056
  13. \## total.eve.minutes 0.0288233 0.1307496 0.220 0.825533
  14. \## total.eve.calls -0.0001585 0.0002238 -0.708 0.478915
  15. \## total.eve.charge -0.3316041 1.5382391 -0.216 0.829329
  16. \## total.night.minutes 0.0083224 0.0695916 0.120 0.904814
  17. \## total.night.calls -0.0001824 0.0002225 -0.820 0.412290
  18. \## total.night.charge -0.1760782 1.5464674 -0.114 0.909355
  19. \## total.intl.minutes -0.0104679 0.4192270 -0.025 0.980080
  20. \## total.intl.calls -0.0063448 0.0018062 -3.513 0.000447 ***
  21. \## total.intl.charge 0.0676460 1.5528267 0.044 0.965254
  22. \## number.customer.service.calls 0.0566474 0.0033945 16.688 < 2e-16 ***
  23. \## total.day.minutes1medium 0.0502681 0.0160228 3.137 0.001715 **
  24. \## total.day.minutes1short 0.2404020 0.0322293 7.459 1.02e-13 ***

 

从结果中看,我们可以发现 state  total.intl.calls   、number.customer.service.calls 、 total.day.minutes1medium 、    total.day.minutes1short    的变量有重要的影响。

Use K-Nearest-Neighbors (K-NN) algorithm to develop a model for predicting Churn

  1. \## Direction.2005
  2. \## knn.pred 1 2
  3. \## 1 760 97
  4. \## 2 100 43
  5. [1] 0.803
 
混淆矩阵(英语:confusion matrix)是可视化工具,特别用于监督学习,在无监督学习一般叫做匹配矩阵。 矩阵的每一列代表一个类的实例预测,而每一行表示一个实际的类的实例。
  1. \## Direction.2005
  2. \## knn.pred 1 2
  3. \## 1 827 104
  4. \## 2 33 36
  5. [1] 0.863

 

从测试集的结果,我们可以看到准确度达到86%。

 

Findings

 

我们可以发现 ,total.day.calls和total.day.charge之间存在一定的相关关系。特别是churn为no的变量之间存在相关关系。同时我们可以发现 state  total.intl.calls   、number.customer.service.calls 、 total.day.minutes1medium、    total.day.minutes1short    的变量有重要的影响。同时我们可以发现,total.day.calls和total.day.charge之间存在一定的相关关系。最后从knn模型结果中,我们可以发现从训练集的结果中,我们可以看到准确度有80%,从测试集的结果,我们可以看到准确度达到86%。说明模型有很好的预测效果。
 

相关文章:

 Python中用PyTorch机器学习分类预测银行_客户流失_模型

决策树算法建立电信_客户流失_模型

【大数据部落】(数据挖掘)如何用大数据做用户异常行为

点赞
收藏
评论区
推荐文章
blmius blmius
4年前
MySQL:[Err] 1292 - Incorrect datetime value: ‘0000-00-00 00:00:00‘ for column ‘CREATE_TIME‘ at row 1
文章目录问题用navicat导入数据时,报错:原因这是因为当前的MySQL不支持datetime为0的情况。解决修改sql\mode:sql\mode:SQLMode定义了MySQL应支持的SQL语法、数据校验等,这样可以更容易地在不同的环境中使用MySQL。全局s
Wesley13 Wesley13
3年前
MySQL部分从库上面因为大量的临时表tmp_table造成慢查询
背景描述Time:20190124T00:08:14.70572408:00User@Host:@Id:Schema:sentrymetaLast_errno:0Killed:0Query_time:0.315758Lock_
美凌格栋栋酱 美凌格栋栋酱
7个月前
Oracle 分组与拼接字符串同时使用
SELECTT.,ROWNUMIDFROM(SELECTT.EMPLID,T.NAME,T.BU,T.REALDEPART,T.FORMATDATE,SUM(T.S0)S0,MAX(UPDATETIME)CREATETIME,LISTAGG(TOCHAR(
皕杰报表之UUID
​在我们用皕杰报表工具设计填报报表时,如何在新增行里自动增加id呢?能新增整数排序id吗?目前可以在新增行里自动增加id,但只能用uuid函数增加UUID编码,不能新增整数排序id。uuid函数说明:获取一个UUID,可以在填报表中用来创建数据ID语法:uuid()或uuid(sep)参数说明:sep布尔值,生成的uuid中是否包含分隔符'',缺省为
Stella981 Stella981
3年前
SpringBoot整合Redis乱码原因及解决方案
问题描述:springboot使用springdataredis存储数据时乱码rediskey/value出现\\xAC\\xED\\x00\\x05t\\x00\\x05问题分析:查看RedisTemplate类!(https://oscimg.oschina.net/oscnet/0a85565fa
Wesley13 Wesley13
3年前
mysql设置时区
mysql设置时区mysql\_query("SETtime\_zone'8:00'")ordie('时区设置失败,请联系管理员!');中国在东8区所以加8方法二:selectcount(user\_id)asdevice,CONVERT\_TZ(FROM\_UNIXTIME(reg\_time),'08:00','0
Easter79 Easter79
3年前
SpringBoot整合Redis乱码原因及解决方案
问题描述:springboot使用springdataredis存储数据时乱码rediskey/value出现\\xAC\\xED\\x00\\x05t\\x00\\x05问题分析:查看RedisTemplate类!(https://oscimg.oschina.net/oscnet/0a85565fa
Stella981 Stella981
3年前
Django中Admin中的一些参数配置
设置在列表中显示的字段,id为django模型默认的主键list_display('id','name','sex','profession','email','qq','phone','status','create_time')设置在列表可编辑字段list_editable
Stella981 Stella981
3年前
PHP+jQuery寥寥几行代码轻松实现百度搜索那样的无刷新PJAX的分页列表和导航链接
!(https://static.oschina.net/uploads/space/2016/1208/171419_U00R_561214.png)PHP寥寥几行代码轻松实现百度搜索那样的分页列表和导航链接,某些语言的拥趸哭晕在厕所.<?php$apparray('db_prefix''
Wesley13 Wesley13
3年前
00_设计模式之语言选择
设计模式之语言选择设计模式简介背景设计模式是一套被反复使用的、多数人知晓的、经过分类编目的、代码设计经验的总结。设计模式(Designpattern)代表了最佳的实践,通常被有经验的面向对象的软件开发人员所采用。设计模式是软件开发人员在软件开发过程中面临的
Python进阶者 Python进阶者
1年前
Excel中这日期老是出来00:00:00,怎么用Pandas把这个去除
大家好,我是皮皮。一、前言前几天在Python白银交流群【上海新年人】问了一个Pandas数据筛选的问题。问题如下:这日期老是出来00:00:00,怎么把这个去除。二、实现过程后来【论草莓如何成为冻干莓】给了一个思路和代码如下:pd.toexcel之前把这
数字踏月鹤
数字踏月鹤
Lv1
你曾说过陪我去看一场雪,我在漫漫黑夜静候你的约
文章
5
粉丝
0
获赞
0