我使用ML Random Forest模型,我希望尽可能地设置它的所有重要参数。因此,为此,我在多个循环中尝试所有可能的变量并保存它们的结果。当我完成时,我只是查看结果,哪种设置是最好的。
所以,仅仅在我自己的电脑上做这件事,我面临的问题是,我的代码在工作3小时后崩溃,因为内存耗尽。因此,我向您提出了两个问题:
- 做我正在做的事情(我是ML的新手)是好的吗?我的意思是通过所有变体来找到最佳设置
- 由于我的内存限制,可以在某个网站上完成吗?在线免费编译器,我可以在其上加载数据文件,并要求他们为我计算变量。
总之,我的代码是:
random_states=[0,42,1000]
min_samples_leafs = np.linspace(0.1, 0.5, 5, endpoint=True)
min_samples_splits = np.linspace(0.1, 1.0, 10, endpoint=True)
n_estimators = [1, 2, 4, 8, 16, 32, 64, 100, 200]
max_depths = np.linspace(1, 32, 32, endpoint=True)
train_results = []
test_results = []
temp_results = []
attempts = [1,2,3,4,5,6,7,8,9,10]
for estimator in n_estimators:
for max_depth in max_depths:
for min_samples_split in min_samples_splits:
for min_samples_leaf in min_samples_leafs:
for random_state in random_states:
for attempt in attempts:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=random_state)
rf = RandomForestClassifier(n_estimators=estimator, max_depth=int(max_depth),n_jobs=-1, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf)
rf.fit(X_train, y_train)
train_pred = rf.predict(X_train)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, train_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
temp_results.append({"estimator":estimator, "max_depth":max_depth, "sample_split":min_samples_split,"sample_leaf":min_samples_leaf,"random_state":random_state,"attempt":attempt,"result":roc_auc})
if attempt==attempts[-1]:
results = 0
for elem in temp_results:
results+=float(elem["result"])
results=results/10
test_results.append({"estimator":estimator, "max_depth":max_depth, "sample_split":min_samples_split,"sample_leaf":min_samples_leaf,"random_state":random_state,"attempt":attempt,"final_result":results})
result= []
max = 0
goat = 0
for dict in test_results:
if dict["final_result"]>max:
max = dict["final_result"]
goat = dict
result.append(dict)
print(datetime.now().strftime("%H:%M:%S"), "END ML")
print(result)
print(goat)