本文共 26670 字,大约阅读时间需要 88 分钟。
作者 | Lavanya Shukla
译者 | Monanfei
责编 | 夕颜
出品 | AI科技大本营(id:rgznai100)
导读:刚开始接触数据竞赛时,我们可能会被一些高大上的技术吓到。各界大佬云集,各种技术令人眼花缭乱,新手们就像蜉蝣一般渺小无助。今天本文就分享一下在 kaggle 的竞赛中,参赛者取得 top0.3% 的经验和技巧。让我们开始吧!
Top 0.3% 模型概览
赛题和目标
数据集中的每一行都描述了某一匹马的特征
在已知这些特征的条件下,预测每匹马的销售价格
预测价格对数和真实价格对数的RMSE(均方根误差)作为模型的评估指标。将RMSE转化为对数尺度,能够保证廉价马匹和高价马匹的预测误差,对模型分数的影响较为一致。
模型训练过程中的重要细节
交叉验证:使用12-折交叉验证
模型:在每次交叉验证中,同时训练七个模型(ridge, svr, gradient boosting, random forest, xgboost, lightgbm regressors)
Stacking 方法:使用 xgboot 训练了元 StackingCVRegressor 学习器
模型融合:所有训练的模型都会在不同程度上过拟合,因此,为了做出最终的预测,将这些模型进行了融合,得到了鲁棒性更强的预测结果
模型性能
从下图可以看出,融合后的模型性能最好,RMSE 仅为 0.075,该融合模型用于最终预测。
In[1]:
from IPython.display import Image Image("../input/kernel-files/model_training_advanced_regression.png")
Output[1]:
现在让我们正式开始吧!
In[2]:
# Essentials import numpy as np import pandas as pd import datetime import random
# Plots import seaborn as sns import matplotlib.pyplot as plt # Models from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, BaggingRegressor from sklearn.kernel_ridge import KernelRidge from sklearn.linear_model import Ridge, RidgeCV from sklearn.linear_model import ElasticNet, ElasticNetCV from sklearn.svm import SVR from mlxtend.regressor import StackingCVRegressor import lightgbm as lgb from lightgbm import LGBMRegressor from xgboost import XGBRegressor # Stats from scipy.stats import skew, norm from scipy.special import boxcox1p from scipy.stats import boxcox_normmax # Misc from sklearn.model_selection import GridSearchCV from sklearn.model_selection import KFold, cross_val_score from sklearn.metrics import mean_squared_error from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import LabelEncoder from sklearn.pipeline import make_pipeline from sklearn.preprocessing import scale from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import RobustScaler from sklearn.decomposition import PCA pd.set_option('display.max_columns', None) # Ignore useless warnings import warnings warnings.filterwarnings(action="ignore") pd.options.display.max_seq_items = 8000 pd.options.display.max_rows = 8000 import os print(os.listdir("../input/kernel-fi
Output[2]:
['model_training_advanced_regression.png']
In[3]:
# Read in the dataset as a dataframe train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv') test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv') train.shape, test.shape
Output[3]:
((1460, 81), (1459, 80))
EDA
目标
数据集中的每一行都描述了某一匹马的特征
在已知这些特征的条件下,预测每匹马的销售价格
对原始数据进行可视化
In[4]:
# Preview the data we're working with train.head()
Output[5]:
SalePrice:目标值的特性探究
In[5]:
sns.set_style("white") sns.set_color_codes(palette='deep') f, ax = plt.subplots(figsize=(8, 7)) #Check the new distribution sns.distplot(train['SalePrice'], color="b"); ax.xaxis.grid(False) ax.set(ylabel="Frequency") ax.set(xlabel="SalePrice") ax.set(title="SalePrice distribution") sns.despine(trim=True, left=True) plt.show()
In[6]:
# Skew and kurt print("Skewness: %f" % train['SalePrice'].skew()) print("Kurtosis: %f" % train['SalePrice'].kurt())
Skewness: 1.882876
Kurtosis: 6.536282
可用的特征:深入探索
数据可视化
In[7]:
# Finding numeric features numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] numeric = [] for i in train.columns: if train[i].dtype in numeric_dtypes: if i in ['TotalSF', 'Total_Bathrooms','Total_porch_sf','haspool','hasgarage','hasbsmt','hasfireplace']: pass else: numeric.append(i) # visualising some more outliers in the data values fig, axs = plt.subplots(ncols=2, nrows=0, figsize=(12, 120)) plt.subplots_adjust(right=2) plt.subplots_adjust(top=2) sns.color_palette("husl", 8) for i, feature in enumerate(list(train[numeric]), 1): if(feature=='MiscVal'): break plt.subplot(len(list(numeric)), 3, i) sns.scatterplot(x=feature, y='SalePrice', hue='SalePrice', palette='Blues', data=train) plt.xlabel('{}'.format(feature), size=15,labelpad=12.5) plt.ylabel('SalePrice', size=15, labelpad=12.5) for j in range(2): plt.tick_params(axis='x', labelsize=12) plt.tick_params(axis='y', labelsize=12) plt.legend(loc='best', prop={'size': 10}) plt.show()
探索这些特征以及 SalePrice 的相关性
In[8]:
corr = train.corr() plt.subplots(figsize=(15,12)) sns.heatmap(corr, vmax=0.9, cmap="Blues", square=True)
Output[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff0e416e4e0>
选取部分特征,可视化它们和 SalePrice 的相关性
Input[9]:
data = pd.concat([train['SalePrice'], train['OverallQual']], axis=1) f, ax = plt.subplots(figsize=(8, 6)) fig = sns.boxplot(x=train['OverallQual'], y="SalePrice", data=data) fig.axis(ymin=0, ymax=800000);
Input[10]:
data = pd.concat([train['SalePrice'], train['YearBuilt']], axis=1) f, ax = plt.subplots(figsize=(16, 8)) fig = sns.boxplot(x=train['YearBuilt'], y="SalePrice", data=data) fig.axis(ymin=0, ymax=800000); plt.xticks(rotation=45);