"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"# Let's perform analysis!\n",
"Hello, I'm Dixhom. Here I talk about how to preform feature engineering, de
lete unwanted variables, build a model and make submission data! So this is a tu
torial for data science beginners. So let's get the ball rolling.\n",
"(This is for a kaggle competition 'Kobe Bryant Shot Selection' (https://www
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"import numpy as np \n",
"import pandas as pd \n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.cross_validation import KFold"
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
"source": [
"# import data\n",
"filename= \"C:\\Users\\ajish\\PythonCoding\\KobeBryant.csv\"\n",
"raw = pd.read_csv(filename)"
"cell_type": "markdown",
"metadata": {},
"source": [
"# Feature engineering\n",
"Now let's start feature engineering. There are many features which should b
e modified or deleted for brevity. Let's take a look into variables.\n",
"First, let's take a look at all the variables."
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"cell_type": "markdown",

"metadata": {},
"source": [
"## Dropping nans\n",
"We are gonna make a variable without `nan` for our exploratory analysis. "
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"nona = raw[pd.notnull(raw['shot_made_flag'])]"
"cell_type": "markdown",
"metadata": {},
"source": [
"## loc_x, loc_y, lat and lon\n",
"What do these mean? From their names, these sound like **location_x, locati
on_y, latitude and longitude**. Let's confirm this assumption. "
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"alpha = 0.02\n",
"# loc_x and loc_y\n",
"plt.scatter(nona.loc_x, nona.loc_y, color='blue', alpha=alpha)\n",
"plt.title('loc_x and loc_y')\n",
"# lat and lon\n",
"plt.scatter(nona.lon, nona.lat, color='green', alpha=alpha)\n",
"plt.title('lat and lon')"
"cell_type": "markdown",
"metadata": {},
"source": [
"These plot are shaped like basket ball courts. So loc_x, loc_y, lat and lon
seem to mean the position from which the ball was tossed. However, since the re
gion under the net is half-circle-shaped, it would be more suitable to transform
the variable into **polar coodinate**."
"cell_type": "code",

"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"raw['dist'] = np.sqrt(raw['loc_x']**2 + raw['loc_y']**2)\n",
"loc_x_zero = raw['loc_x'] == 0\n",
"raw['angle'] = np.array([0]*len(raw))\n",
"raw['angle'][~loc_x_zero] = np.arctan(raw['loc_y'][~loc_x_zero] / raw['loc_
"raw['angle'][loc_x_zero] = np.pi / 2 "
"cell_type": "markdown",
"metadata": {},
"source": [
"Since some of loc_x values cause an error by zero-division, we set just `np
.pi / 2` to the corresponding rows.\n",
"## minutes_remaining and seconds_remaining\n",
"`minutes_remaining` and `seconds_remaining` seem to be a pair, so let's com
bine them together."
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"raw['remaining_time'] = raw['minutes_remaining'] * 60 + raw['seconds_remain
"cell_type": "markdown",
"metadata": {},
"source": [
"## action_type, combined_shot_type, shot_type\n",
"These represents how the player shot a ball."
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [

"cell_type": "markdown",
"metadata": {},
"source": [
"## Season\n",
"`Season` looks like consisting of two parts."
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"`Season` seems to be composed of two parts: season year and season ID. Here
we only need season ID. Let's modify the data."
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"raw['season'] = raw['season'].apply(lambda x: int(x.split('-')[1]) )\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"## team_id and team_name\n",
"These contain the same one value for each. Seem useless. "
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [

"cell_type": "markdown",
"metadata": {},
"source": [
"## opponent , matchup\n",
"These are basically the same information. "
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"pd.DataFrame({'matchup':nona.matchup, 'opponent':nona.opponent})"
"cell_type": "markdown",
"metadata": {},
"source": [
"Only opponent is needed."
"cell_type": "markdown",
"metadata": {},
"source": [
"## Shot distance\n",
"We already defined this."
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"plt.scatter(raw.dist, raw.shot_distance, color='blue')\n",
"plt.title('dist and shot_distance')"
"cell_type": "markdown",
"metadata": {},
"source": [
"`shot_distance` is proportional to `dist` and this won't be necessary.\n",
"## shot_zone_area, shot_zone_basic, shot_zone_range\n",
"These sound like some regions on the court, so let's visualize it."
"cell_type": "code",
"execution_count": null,

"metadata": {
"collapsed": false
"outputs": [],
"source": [
"import matplotlib.cm as cm\n",
"def scatter_plot_by_category(feat):\n",
alpha = 0.1\n",
gs = nona.groupby(feat)\n",
cs = cm.rainbow(np.linspace(0, 1, len(gs)))\n",
for g, c in zip(gs, cs):\n",
plt.scatter(g[1].loc_x, g[1].loc_y, color=c, alpha=alpha)\n",
"# shot_zone_area\n",
"# shot_zone_basic\n",
"# shot_zone_range\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"As we thought, these represent regions on the court. However, these regions
can be separated by `dist` and `angle`. So we don't need these."
"cell_type": "markdown",
"metadata": {},
"source": [
"## dropping unneeded variables\n",
"Let's drop unnecessary variables."
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"drops = ['shot_id', 'team_id', 'team_name', 'shot_zone_area', 'shot_zone_ra
nge', 'shot_zone_basic', \\\n",
'matchup', 'lon', 'lat', 'seconds_remaining', 'minutes_remaining',
'shot_distance', 'loc_x', 'loc_y', 'game_event_id', 'game_id', 'ga

"for drop in drops:\n",
raw = raw.drop(drop, 1)"
"cell_type": "markdown",
"metadata": {},
"source": [
"## make dummy variables\n",
"We are going to use randomForest classifier for building our models but thi
s doesn't accept string variables like 'action_type'. So we are going to make du
mmy variables for those."
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"# turn categorical variables into dummy variables\n",
"categorical_vars = ['action_type', 'combined_shot_type', 'shot_type', 'oppo
nent', 'period', 'season']\n",
"for var in categorical_vars:\n",
raw = pd.concat([raw, pd.get_dummies(raw[var], prefix=var)], 1)\n",
raw = raw.drop(var, 1)"
"cell_type": "markdown",
"metadata": {},
"source": [
"## separating data for training and submission\n",
"Now let's separate data."
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"df = raw[pd.notnull(raw['shot_made_flag'])]\n",
"submission = raw[pd.isnull(raw['shot_made_flag'])]\n",
"submission = submission.drop('shot_made_flag', 1)"
"cell_type": "markdown",
"metadata": {},
"source": [
"We are separating `df` further into explanatory and response variables."

"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"# separate df into explanatory and response variables\n",
"train = df.drop('shot_made_flag', 1)\n",
"train_y = df['shot_made_flag']"
"cell_type": "markdown",
"metadata": {},
"source": [
"## logloss\n",
"Submissions are evaluated on the log loss. We are going to use it for evalu
ating our model."
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"import scipy as sp\n",
"def logloss(act, pred):\n",
epsilon = 1e-15\n",
pred = sp.maximum(epsilon, pred)\n",
pred = sp.minimum(1-epsilon, pred)\n",
ll = sum(act*sp.log(pred) + sp.subtract(1,act)*sp.log(sp.subtract(1,pre
ll = ll * -1.0/len(act)\n",
return ll"
"cell_type": "markdown",
"metadata": {},
"source": [
"# Building a model\n",
"Now it's time to build a model. We use randomForest classifier and k-fold c
ross validation for testing our model.\n",
"We are going to...\n",
"1. pick a `n` from `n_range` for the number of estimators in randomForestCl
"1. divide the training data into 10 pieces\n",
"2. pick 9 of them for building a model and use the remaining 1 for testing
a model\n",
"3. repeat the same process for the other 9 pieces.\n",
"4. calculate score for each and take an average of them\n",
"5. pick the next `n` and do the process again\n",
"6. find the `n` which gave the best score among `n_range`\n",
"7. repeat the same process with the tree depth parameter.\n",

"You can change the value of `np.logspace` for searching optimum value in br
oader area."
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"from sklearn.ensemble import RandomForestRegressor\n",
"from sklearn.metrics import confusion_matrix\n",
"import time\n",
"# find the best n_estimators for RandomForestClassifier\n",
"print('Finding best n_estimators for RandomForestClassifier...')\n",
"min_score = 100000\n",
"best_n = 0\n",
"scores_n = []\n",
"range_n = np.logspace(0,2,num=3).astype(int)\n",
"for n in range_n:\n",
print(\"the number of trees : {0}\".format(n))\n",
t1 = time.time()\n",
rfc_score = 0.\n",
rfc = RandomForestClassifier(n_estimators=n)\n",
for train_k, test_k in KFold(len(train), n_folds=10, shuffle=True):\n",
rfc.fit(train.iloc[train_k], train_y.iloc[train_k])\n",
#rfc_score += rfc.score(train.iloc[test_k], train_y.iloc[test_k])/1
pred = rfc.predict(train.iloc[test_k])\n",
rfc_score += logloss(train_y.iloc[test_k], pred) / 10\n",
if rfc_score < min_score:\n",
min_score = rfc_score\n",
best_n = n\n",
t2 = time.time()\n",
print('Done processing {0} trees ({1:.3f}sec)'.format(n, t2-t1))\n",
"print(best_n, min_score)\n",
"# find best max_depth for RandomForestClassifier\n",
"print('Finding best max_depth for RandomForestClassifier...')\n",
"min_score = 100000\n",
"best_m = 0\n",
"scores_m = []\n",
"range_m = np.logspace(0,2,num=3).astype(int)\n",
"for m in range_m:\n",
print(\"the max depth : {0}\".format(m))\n",
t1 = time.time()\n",
rfc_score = 0.\n",
rfc = RandomForestClassifier(max_depth=m, n_estimators=best_n)\n",
for train_k, test_k in KFold(len(train), n_folds=10, shuffle=True):\n",
rfc.fit(train.iloc[train_k], train_y.iloc[train_k])\n",
#rfc_score += rfc.score(train.iloc[test_k], train_y.iloc[test_k])/1

pred = rfc.predict(train.iloc[test_k])\n",
rfc_score += logloss(train_y.iloc[test_k], pred) / 10\n",
if rfc_score < min_score:\n",
min_score = rfc_score\n",
best_m = m\n",
t2 = time.time()\n",
print('Done processing {0} trees ({1:.3f}sec)'.format(m, t2-t1))\n",
"print(best_m, min_score)\n"
"cell_type": "markdown",
"metadata": {},
"source": [
"# Visualizing parameters for randomForest\n",
"By visualizing the parameters, we can check if the chosen parameter is real
ly the best."
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"plt.plot(range_n, scores_n)\n",
"plt.xlabel('number of trees')\n",
"plt.plot(range_m, scores_m)\n",
"plt.xlabel('max depth')"
"cell_type": "markdown",
"metadata": {},
"source": [
"# Building a final model\n",
"Let's use the parameters we just got for the final model and prediction."
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"model = RandomForestClassifier(n_estimators=best_n, max_depth=best_m)\n",
"model.fit(train, train_y)\n",

"pred = model.predict_proba(submission)"
"cell_type": "markdown",
"metadata": {},
"source": [
"# Making submission data\n",
"Predicted shot_made_flag is written to a csv file."
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
"outputs": [],
"source": [
"sub = pd.read_csv(\"../input/sample_submission.csv\")\n",
"sub['shot_made_flag'] = pred\n",
"sub.to_csv(\"real_submission.csv\", index=False)"
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
"nbformat": 4,
"nbformat_minor": 0

