我有一些工作中的数据基本上是保密的,所以我不能在这里分享,但下面的数据集很好地说明了这一点。基本上,我想运行一个功能重要性练习,以找到对从属功能(MDEV)影响最大的顶级独立功能(在本例中为RM、LSTAT和DIS)。完成了!我的问题是…我如何使用此模型来查找与顶级独立功能(RM、LSTAT和DIS)相关联的ID?
查看图后,是否只是按RM、LSTAT和DIS的降序对数据帧进行排序,因为这些是影响从属特性的最具影响力的特性?我不认为它是这样工作的,但也许就是这样。在这种情况下,考虑到我的业务需求,我假设RM、LSTAT和DIS是“最差”的功能。
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
#Loading the dataset
x = load_boston()
df = pd.DataFrame(x.data, columns = x.feature_names)
df["MEDV"] = x.target
X = df.drop("MEDV",1) #Feature Matrix
y = df["MEDV"] #Target Variable
df.head()
df['id'] = df.groupby(['MEDV']).ngroup()
df = df.sort_values(by=['MEDV'], ascending=True)
df.head(10)
names = df.columns
reg = RandomForestRegressor()
reg.fit(X, y)
print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), reg.feature_importances_), names), reverse=True))
features = names
importances = reg.feature_importances_
indices = np.argsort(importances)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='#8f63f4', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')
plt.show()