我有一些工作中的数据基本上是保密的,所以我不能在这里分享,但下面的数据集很好地说明了这一点。基本上,我想运行一个功能重要性练习,以找到对从属功能(MDEV)影响最大的顶级独立功能(在本例中为RM、LSTAT和DIS)。完成了!我的问题是…我如何使用此模型来查找与顶级独立功能(RM、LSTAT和DIS)相关联的ID?
查看图后,是否只是按RM、LSTAT和DIS的降序对数据帧进行排序,因为这些是影响从属特性的最具影响力的特性?我不认为它是这样工作的,但也许就是这样。在这种情况下,考虑到我的业务需求,我假设RM、LSTAT和DIS是“最差”的功能。
from sklearn.datasets import load_boston import pandas as pd import numpy as np import matplotlib import matplotlib.pyplot as plt import seaborn as sns import statsmodels.api as sm from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.feature_selection import RFE from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso from sklearn.ensemble import RandomForestRegressor #Loading the dataset x = load_boston() df = pd.DataFrame(x.data, columns = x.feature_names) df["MEDV"] = x.target X = df.drop("MEDV",1) #Feature Matrix y = df["MEDV"] #Target Variable df.head() df['id'] = df.groupby(['MEDV']).ngroup() df = df.sort_values(by=['MEDV'], ascending=True) df.head(10) names = df.columns reg = RandomForestRegressor() reg.fit(X, y) print("Features sorted by their score:") print(sorted(zip(map(lambda x: round(x, 4), reg.feature_importances_), names), reverse=True)) features = names importances = reg.feature_importances_ indices = np.argsort(importances) plt.title('Feature Importances') plt.barh(range(len(indices)), importances[indices], color='#8f63f4', align='center') plt.yticks(range(len(indices)), features[indices]) plt.xlabel('Relative Importance') plt.show()