Kaggle Bosch Production Line Performance, NO.74/top 6% (post competition analysis) 生產線分析、前 6 % ( 賽後分析 )
Bosch Production Line Performance e-mail : [email protected]
: 4000 Fitted model 50 top 6% 50
: R code parallel package mclapply linux mclapply sapply
ps :
---/90%NA
NA AIC/BIC/lasso missing value XGB importance kernel Daniel FG code --- Feature Engineering
Bosch Production Line Performance ( )
4000
Response | 0 | 1 |
---|---|---|
1176868 | 6879 |
rate of Response 1 = 0.0058
Kaggle :
data | size | n () | p () | R ram |
---|---|---|---|---|
train_numeric | 2.1GB | 100 | 970 | 8.5 gb |
train_date | 2.9GB | 100 | 1157 | 10.2 gb |
train_categorical | 2.7GB | 100 | 2141 | |
test_numeric | 2.1GB | 100 | 969 | |
test_date | 2.9GB | 100 | 1157 | |
test_categorical | 2.7GB | 100 | 2141 |
:
Response | , 0 : , 1 : |
Id | |
Lx_Sx_Fx | L : lineS : stationF : feature number |
L3_S36_F3939 3 36 3939 numeric date categorical
train_numeric :
Id | L0_S0_F0 | L0_S0_F2 | L0_S0_F4 | L0_S0_F6 | Response |
---|---|---|---|---|---|
11 | -0.055 | -0.086 | 0.294 | 0.330 | 0 |
13 | 0.003 | 0.019 | 0.294 | 0.312 | 0 |
14 | NA | NA | NA | NA | 0 |
16 | NA | NA | NA | NA | 0 |
18 | -0.016 | -0.041 | -0.179 | -0.179 | 0 |
train_date
Id | L0_S0_F0 | L0_S0_F2 | L0_S0_F4 | L0_S0_F6 |
---|---|---|---|---|
11 | 602.64 | 602.64 | 602.64 | 602.64 |
13 | 1331.66 | 1331.66 | 1331.66 | 1331.66 |
14 | NA | NA | NA | NA |
16 | NA | NA | NA | NA |
18 | 517.64 | 517.64 | 517.64 | 517.64 |
evaluation MCC
date data
feature | |
---|---|
first | |
min | |
last | |
max | |
class.amount | |
na.amount | na |
L0L1L2L3 ex : all_first, L0_first, L1_first, L2_first, L3_first data 100 feature feature engineering 1 kaggle rank 50% feature engineering 2
feature | code | |
---|---|---|
next | ||
prev | ||
total | total | next+prev |
same.time | total>0 | |
order.same.time | (cumsum(prev)+1) * same.time | |
group | ||
group.length | table(group) | |
cost.time | max-min | |
prev.cost.time | cost.time-c(NA,cost.time[1:length(cost.time)-1]) | |
next.cost.time | cost.time-c(cost.time[2:length(cost.time)],NA) | |
prev.na.amount | na | na.amount--c(NA,na.amount[1:length(na.amount)-1]) |
next.na.amount | na | na.amount--c(na.amount[2:length(na.amount)],NA) |
prev.target | c(NA,target[1:(nrow(target)-1)]) | |
next.target | c(target[2:nrow(target)],NA) |
L0~L3
train_numeric
100 100
100 100 0.058 0.045 ID L3_S32_F3850
train_numeric :
- | res1.per | var.name |
---|---|---|
1 | 0.0451 | L3_S32_F3850 |
2 | 0.0093 | L1_S24_F1768 |
3 | 0.0093 | L1_S24_F1763 |
. | ... | ... |
. | ... | ... |
968 | 0.0003 | L1_S25_F2512 |
feature engineering 1feature engineering 2 450 XGBoost xgb.cv bset nrounds bset nrounds
xgb.importance 50 50 fitted model feature
amount of var = 450
pred | |||
---|---|---|---|
0 | 1 | ||
real | 0 | 1176353 | 515 |
1 | 4216 | 2663 |
MCC = 0.568
amount of var = 50
pred | |||
---|---|---|---|
0 | 1 | ||
real | 0 | 1176304 | 564 |
1 | 4360 | 2519 |
MCC = 0.545
( 0 )( 1 )( )
xgb.cv bset nrounds nrounds xgb.cv nrounds
nrounds | train-rmse | test-rmse |
---|---|---|
11 | 0.168337 | 0.168881 |
21 | 0.083219 | 0.085046 |
31 | 0.064824 | 0.067989 |
41 | 0.061830 | 0.065582 |
51 | 0.061191 | 0.065279**** |
61 | 0.060756 | 0.065227 |
71 | 0.060327 | 0.065229 |
Best iteration : 67, train-rmse:0.060464 test-rmse:0.065220
nround = 50 test-rmse train & test model
imbalance evaluation --- MCC XGBoost evaluation
feature engineering 1feature engineering 2 train_numeric xgb.imporance 50 feature feature XGBoost evaluation MCC rmse imbalance rate = 0.25 imblance
Fitted model top 6% rank MCC 2 ( 0.18 -> 0.46 ) feature
L3_S32_F3850** | L1_S24_F1723** | L3_S33_F3859** | L3_S33_F3855 | L1_S24_F1846 | ||||
L3_S33_F3865 | L1_S24_F1632 | L3_S33_F3857 | L3_S38_F3956 | L1_S24_F1498 | ||||
L1_S24_F1604 | L3_S41_F4014 | L1_S24_F1695 | L3_S38_F3952 | L3_S33_F3873 | ||||
L1_S24_F1844 | L3_S38_F3960 | L2_S26_F3036 | L2_S26_F3040 | L2_S26_F3047 | ||||
L2_S26_F3073 | L1_S24_F1672 | L1_S24_F1609 | L1_S24_F1685 |
all_next** | next.cost.time** | next.traget** | L0_first** | |||
all_prev** | prev.traget** | group.amount** | next.na.amount | |||
all_na.amount | L3_first | total | cost.time | |||
L2_first | L3_na.amount | prev.cost.time | group | |||
L3_last | prev.na.amount | L3_min | L0_min | |||
all_first | all_class.amount | L1_first | L3_max | |||
order.same.time | L0_last |
feature plot :
ML 1. 2. ( ) 3. modeltype 1 error vs typr 2 error 4. datamodel 100 datadatavsML DATA/ 5. DL/ensemble 6.