model_selection 모듈

지지킴 2023. 9. 3. 15:45

< 목차 >

1. 학습데이터/테스트 데이터 분리
1-1. train_test_split()

2. 교차검증 분할, 평가
2-1. KFold
2-2. StratifiedKFold
2-3. cross_val_score

3. 교차검증 + 최적 하이퍼 파라미터 튜닝 한번에
3-1. GridSearchCV

1. 학습데이터, 테스트 데이터 분리

1-1. train_test_split()

여기서는 iris dataset를 이용해보겠다. load_iris를 통해 받을 수 있고, 딕셔너리 형식이다.

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()

# 데이터 쪼개기
X_train, X_test, y_train, y_test =train_test_split(
        						iris.data, iris.target,
        						test_size=0.3, random_state= 121)

# Classifier 생성
dt_clf = DecisionTreeClassifier()

# 학습(fit)
dt_clf.fit(X_train, y_train)

# 예측(predict)
pred = dt_clf.predict(X_test)

print('예측정확도:  {:.3f}'.format(accuracy_score(y_test, pred)))


>>결과
예측정확도:  0.956

이 때, train_test_split()에 stratiry=iris.target을 추가하면, iris.target의 비율에 맞춰 train, test 데이터가 분할된다.

X_train, X_test, y_train, y_test =train_test_split(iris.data, iris.target,
        				test_size=0.3, random_state= 121, stratify = iris.target)

2. 교차검증

2-1. KFold

from sklearn.model_selection import KFold
import numpy as np

n_iter = 0

# 데이터 로드
iris = load_iris()
features = iris.data
label = iris.target

# 분류기
dt_clf = DecisionTreeClassifier(random_state =156)
kfold = KFold(n_splits=5) # 5개로 쪼개겠다.
cv_accuracy=[]

# 하기 for문을 통해 train, test data의 인덱스가 리턴됨.
for train_index, test_index in kfold.split(features):
    X_train, X_test= features[train_index], features[test_index]
    y_train, y_test = label[train_index], label[test_index]
    
    # 학습
    dt_clf.fit(X_train, y_train)
    # 예측
    pred = dt_clf.predict(X_test)
    # 평가
    accuracy = round(accuracy_score(y_test, pred),3)
    cv_accuracy.append(accuracy)
    n_iter +=1
    
    print('{0} 정확도:{1}'.format(n_iter, accuracy))
    print('-'*20)

# 최종 정확도는 지금까지의 accuracy를 모두 저장해놓은 list인 cv_accuracy를 평균낸 점수.
print('최종 정확도: {:.3f}'.format(np.mean(cv_accuracy)))

>>>결과
1 정확도:1.0
--------------------
2 정확도:0.967
--------------------
3 정확도:0.867
--------------------
4 정확도:0.933
--------------------
5 정확도:0.733
--------------------
최종 정확도: 0.900

**for train_index, test_index in kfold.split(features): 의 결과

n_splits=5니까 총 5번의 검증이 진행됨.

# 첫번째 train, test 세트
[ 30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47
  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65
  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83
  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
138 139 140 141 142 143 144 145 146 147 148 149] # train_index
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29] # test_index

# 두번째 train, test 세트
[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  60  61  62  63  64  65
  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83
  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
138 139 140 141 142 143 144 145 146 147 148 149]
[30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
54 55 56 57 58 59]

# 세번째 train, test 세트
[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  90  91  92  93  94  95  96  97  98  99 100 101
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
138 139 140 141 142 143 144 145 146 147 148 149]
[60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
84 85 86 87 88 89]

# 네번째 train, test 세트
[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
138 139 140 141 142 143 144 145 146 147 148 149]
[ 90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
108 109 110 111 112 113 114 115 116 117 118 119]

# 다섯번째 train, test 세트
[  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
108 109 110 111 112 113 114 115 116 117 118 119]
[120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
138 139 140 141 142 143 144 145 146 147 148 149]

2-2. StratifiedKFold

레이블 데이터(target)의 분포도(=클래스의 비율)에 따라 학습, 검증 데이터를 나눔 → 데이터가 골고루 분포

from sklearn.model_selection import StratifiedKFold
import numpy as np

n_iter = 0
iris = load_iris()
features = iris.data
label = iris.target

dt_clf = DecisionTreeClassifier(random_state =156)

stf_kfold = StratifiedKFold(n_splits=3)
cv_accuracy=[]

# label의 비율에 따라 학습, 검증데이터를 나누기 위해 파라미터에 label도 추가.
for train_index, test_index in stf_kfold.split(features,label):
    X_train, X_test= features[train_index], features[test_index]
    y_train, y_test = label[train_index], label[test_index]
    
    dt_clf.fit(X_train, y_train)
    pred = dt_clf.predict(X_test)
    accuracy = round(accuracy_score(y_test, pred),3)
    cv_accuracy.append(accuracy),3
    n_iter +=1
    
    print('{0} 정확도:{1}'.format(n_iter, accuracy))
    print('-'*20)

print('최종 정확도: {:.3f}'.format(np.mean(cv_accuracy)))


>>>결과
1 정확도:0.98
--------------------
2 정확도:0.94
--------------------
3 정확도:0.98
--------------------
최종 정확도: 0.967

2-3. cross_val_score

KFold(또는 StratifiedKFold)의 과정을 한번에
교차검증의 결과(score)를 반환

from sklearn.model_selection import cross_val_score

iris_data = load_iris()
dt_clf = DecisionTreeClassifier(random_state=156)

data = iris.data
label = iris.target


# 분류모델, feature, label, 평가방법, fold 수 모두 넣음
scores = cross_val_score(dt_clf, data, label, scoring='accuracy',cv=3)

print('교차 검증별 정확도:', scores)
print('평균 검증 정확도:', round(np.mean(scores),4))


>>>결과
교차 검증별 정확도: [0.98 0.94 0.98]
평균 검증 정확도: 0.9667

3. GridSearchCV

교차검증 & 최적 파라미터 튜닝을 한번에!!

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV


iris_data = load_iris()

# 학습, 테스트 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(
                    iris_data.data, iris_data.target,
                  test_size= 0.2, random_state =121)

# 분류기 생성
dtree = DecisionTreeClassifier()

# 테스트할 parameter 설정
grid_parameters ={'max_depth':[1,2,3],
		'min_samples_split':[2,3]}

# GridSearchCV 객체 생성
grid_dtree = GridSearchCV(dtree, param_grid = grid_parameters, cv=3, refit=True)

# 학습
grid_dtree.fit(X_train, y_train)

fit(학습)의 결과가 cv_results_ 에 저장됨 (df로 만들어서 보면 편하다.)

scores_df = pd.DataFrame(grid_dtree.cv_results_)
scores_df[['params','mean_test_score','rank_test_score','split0_test_score','split1_test_score','split2_test_score']]

>>결과
max_depth:3 & min_samples_split:2 일때 rank 1위임 = 최고 파라미터

cv_results_의 결과(train data의 학습 결과)는 아래의 명령어로 확인할 수 있다.
- best_params_: 최고 성능을 나타낸 하이퍼 파라미터 값
- best_score_: best_params로 학습하여 평가한 값(정확도 score)

print('GridSearchCV 최적 파라미터:', grid_dtree.best_params_)
print('GridSearchCV 최고 정확도:{:.3f}'.format(grid_dtree.best_score_) )


>>>결과
GridSearchCV 최적 파라미터: {'max_depth': 3, 'min_samples_split': 2}
GridSearchCV 최고 정확도:0.975 # 훈련데이터에서의 최고 성능(정확도)

refit = True?
- parameter 설정 시 refit=True로 설정하면 테스트 결과 중 최고의 파라미터 조합으로 모델을 학습하여 best_estimator_에 저장해줌.
- 위에서는 max_depth = 3, min_samples_split = 2가 최적 파라미터였고, 이 파라미터를 가지고 dtree를 학습(fit)한 것을 best_estimator_에 저장해놓음.
- best_estimator_(X_train, y_train결과 최적 파라미터라고 결론지어진 max_depth:3, minn_samples_split:2) 를 가지고 이제 테스트 데이터인 X_test에 predict()를, y_test와 pred값을 이용하여 accuracy를 산출함.

estimator = grid_dtree.best_estimator_ # 최적의 파라미터를 가지고 학습한 것을 estimator에 저장해놓음

 
# 최적의 파라미터일 때 예측 결과.
pred = estimator.predict(X_test)
print('테스트 데이터의 세트 정확도: {:.4f}'.format(accuracy_score(y_test, pred)))

>>결과
테스트 데이터의 세트 정확도: 0.9667 # 테스트 데이터에 대한 성능
# max_depth:3 & min_samples_split2일 때(=최적의 파라미터,best_params)일때의 정확도