Lets see how can we perform linear classification using TensorFlow library in Python. We will use LinearClassifier function from TensorFlow Estimator. We will use California Census Data and try to predict what class of income (>50k or <=50k) people belong to. You can download this dataset from here. This dataset has 32561 observations and 15 features. You can also download my Jupyter notebook containing below code from here. So, lets get started.
Step 1: Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
Step 2: Load and explore the dataset
dataset = pd.read_csv(‘adult.csv')
dataset.head()
dataset.size
dataset.shape
dataset.columns
dataset.dtypes
dataset.describe()
Step 3: Drop fnlwgt column
We are not going to use this column as it does not seem to contribute any relevant information in our prediction. So, better drop it.
dataset.drop(‘fnlwgt', axis=1, inplace=True)
Step 4: Convert label into 0 and 1
dataset[‘income'].unique()
Output: array([‘<=50K', ‘>50K'], dtype=object)
It means, we have only two string labels. Lets convert these into numeric labels (0 and 1).
def label_fix(label):
if label == ‘<=50K':
return 0
else:
return 1
dataset[‘income'] = dataset[‘income'].apply(label_fix)
dataset.head()
dataset[‘income'].unique()
dataset[‘income'].value_counts()
Step 5: Split dataset into training and testing set
X = dataset.drop(‘income', axis=1)
y = dataset[‘income']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
Step 6: Create Feature Columns
All the independent variables need to be converted into a proper type of tensor. The estimator needs to have a list of features to train the model. Hence, the column's data requires to be converted into a tensor.
We need to create feature columns for our numeric and categorical data. Feature columns act as the intermediaries between raw data and TensorFlow Estimators.
Convert numeric columns into feature columns.
tf.feature_column.numeric_column: Use this to convert numeric column into feature columns.
Convert categorical columns into feature columns.
tf.feature_column.categorical_column_with_hash_bucket: Use this if you don’t know the set of possible values for a categorical column in advance and there are too many of them.
tf.feature_column.categorical_column_with_vocabulary_list: Use this if you know the set of all possible feature values of a column and there are only a few of them
So, lets convert our all the columns into feature columns as discussed above.
workclass = tf.feature_column.categorical_column_with_hash_bucket(‘workclass', hash_bucket_size=1000)
education = tf.feature_column.categorical_column_with_hash_bucket(‘education', hash_bucket_size=1000)
marital_status = tf.feature_column.categorical_column_with_hash_bucket(‘marital_status', hash_bucket_size=1000)
occupation = tf.feature_column.categorical_column_with_hash_bucket(‘occupation', hash_bucket_size=1000)
relationship = tf.feature_column.categorical_column_with_hash_bucket(‘relationship', hash_bucket_size=1000)
race = tf.feature_column.categorical_column_with_hash_bucket(‘race', hash_bucket_size=1000)
sex = tf.feature_column.categorical_column_with_vocabulary_list(‘sex', [‘Female', ‘Male'])
native_country = tf.feature_column.categorical_column_with_hash_bucket(‘native_country', hash_bucket_size=1000)
age = tf.feature_column.numeric_column(‘age')
education_num = tf.feature_column.numeric_column(‘education_num')
capital_gain = tf.feature_column.numeric_column(‘capital_gain')
capital_loss = tf.feature_column.numeric_column(‘capital_loss')
hours_per_week = tf.feature_column.numeric_column(‘hours_per_week')
feature_columns = [workclass, education, marital_status, occupation, relationship, race, sex, native_country, age, education_num, capital_gain, capital_loss, hours_per_week]
Step 7: Create Input Function
We now create an input function that would feed Pandas DataFrame into our classifier model. It requires you to specify the features, labels and batch size. It also has a special argument called shuffle,which allows the model to read the records in a random order, thereby improving model performance. You can also specify number of epochs you want to use.
input_fn = tf.estimator.inputs.pandas_input_fn(x=X_train, y=y_train, batch_size=128, num_epochs=None, shuffle=True)
I have set the batch size of 128 and None for number of epochs. By default number of epochs is 1.
Step 8: Create a model using feature columns and input function
model = tf.estimator.LinearClassifier(feature_columns = feature_columns)
model.train(input_fn = input_fn, steps=1000)
Let the optimizer perform 1000 steps.
Step 9: Make predictions
pred_fn = tf.estimator.inputs.pandas_input_fn(x=X_test, batch_size=len(X_test), shuffle=False)
predictions = list(model.predict(input_fn = pred_fn))
predictions[0]
final_preds = []
for pred in predictions:
final_preds.append(pred[‘class_ids'][0])
final_preds[:10]
df=pd.DataFrame({‘Actual':y_test, ‘Predicted':final_preds})
df
Step 10: Check accuracy
print(classification_report(y_test, final_preds))
print(confusion_matrix(y_test, final_preds))
print(accuracy_score(y_test, final_preds))
We got around 82.5% accuracy. You can play around with hyper-parameters like number of epochs, number of steps, batch size etc. to improve the accuracy.
Step 1: Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
Step 2: Load and explore the dataset
dataset = pd.read_csv(‘adult.csv')
dataset.head()
dataset.size
dataset.shape
dataset.columns
dataset.dtypes
dataset.describe()
Step 3: Drop fnlwgt column
We are not going to use this column as it does not seem to contribute any relevant information in our prediction. So, better drop it.
dataset.drop(‘fnlwgt', axis=1, inplace=True)
Step 4: Convert label into 0 and 1
dataset[‘income'].unique()
Output: array([‘<=50K', ‘>50K'], dtype=object)
It means, we have only two string labels. Lets convert these into numeric labels (0 and 1).
def label_fix(label):
if label == ‘<=50K':
return 0
else:
return 1
dataset[‘income'] = dataset[‘income'].apply(label_fix)
dataset.head()
dataset[‘income'].unique()
dataset[‘income'].value_counts()
Step 5: Split dataset into training and testing set
X = dataset.drop(‘income', axis=1)
y = dataset[‘income']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
Step 6: Create Feature Columns
All the independent variables need to be converted into a proper type of tensor. The estimator needs to have a list of features to train the model. Hence, the column's data requires to be converted into a tensor.
We need to create feature columns for our numeric and categorical data. Feature columns act as the intermediaries between raw data and TensorFlow Estimators.
Convert numeric columns into feature columns.
tf.feature_column.numeric_column: Use this to convert numeric column into feature columns.
Convert categorical columns into feature columns.
tf.feature_column.categorical_column_with_hash_bucket: Use this if you don’t know the set of possible values for a categorical column in advance and there are too many of them.
tf.feature_column.categorical_column_with_vocabulary_list: Use this if you know the set of all possible feature values of a column and there are only a few of them
So, lets convert our all the columns into feature columns as discussed above.
workclass = tf.feature_column.categorical_column_with_hash_bucket(‘workclass', hash_bucket_size=1000)
education = tf.feature_column.categorical_column_with_hash_bucket(‘education', hash_bucket_size=1000)
marital_status = tf.feature_column.categorical_column_with_hash_bucket(‘marital_status', hash_bucket_size=1000)
occupation = tf.feature_column.categorical_column_with_hash_bucket(‘occupation', hash_bucket_size=1000)
relationship = tf.feature_column.categorical_column_with_hash_bucket(‘relationship', hash_bucket_size=1000)
race = tf.feature_column.categorical_column_with_hash_bucket(‘race', hash_bucket_size=1000)
sex = tf.feature_column.categorical_column_with_vocabulary_list(‘sex', [‘Female', ‘Male'])
native_country = tf.feature_column.categorical_column_with_hash_bucket(‘native_country', hash_bucket_size=1000)
age = tf.feature_column.numeric_column(‘age')
education_num = tf.feature_column.numeric_column(‘education_num')
capital_gain = tf.feature_column.numeric_column(‘capital_gain')
capital_loss = tf.feature_column.numeric_column(‘capital_loss')
hours_per_week = tf.feature_column.numeric_column(‘hours_per_week')
feature_columns = [workclass, education, marital_status, occupation, relationship, race, sex, native_country, age, education_num, capital_gain, capital_loss, hours_per_week]
Step 7: Create Input Function
We now create an input function that would feed Pandas DataFrame into our classifier model. It requires you to specify the features, labels and batch size. It also has a special argument called shuffle,which allows the model to read the records in a random order, thereby improving model performance. You can also specify number of epochs you want to use.
input_fn = tf.estimator.inputs.pandas_input_fn(x=X_train, y=y_train, batch_size=128, num_epochs=None, shuffle=True)
I have set the batch size of 128 and None for number of epochs. By default number of epochs is 1.
Step 8: Create a model using feature columns and input function
model = tf.estimator.LinearClassifier(feature_columns = feature_columns)
model.train(input_fn = input_fn, steps=1000)
Let the optimizer perform 1000 steps.
Step 9: Make predictions
pred_fn = tf.estimator.inputs.pandas_input_fn(x=X_test, batch_size=len(X_test), shuffle=False)
predictions = list(model.predict(input_fn = pred_fn))
predictions[0]
final_preds = []
for pred in predictions:
final_preds.append(pred[‘class_ids'][0])
final_preds[:10]
df=pd.DataFrame({‘Actual':y_test, ‘Predicted':final_preds})
df
Step 10: Check accuracy
print(classification_report(y_test, final_preds))
print(confusion_matrix(y_test, final_preds))
print(accuracy_score(y_test, final_preds))
We got around 82.5% accuracy. You can play around with hyper-parameters like number of epochs, number of steps, batch size etc. to improve the accuracy.