OKCupid Date-A-Scientist
This projects utilises EDA and supervised machine learning algorithms in an attempt to predict the zodiac sign of OkCupid users.
- Introduction
- Importing the data
- Exploring the data
- Continuous variables
- Discrete variables
- Data preparation
- Models
- NLP and Naive Bayes
- Evaluation
Introduction
In recent years, there has been a massive rise in the usage of dating apps to find love. Many of these apps use sophisticated data science techniques to recommend possible romantic matches and optimize the user experience. These apps give us access to a wealth of information that we’ve never had before about how different people experience romance.
Scope
This project utilises information from dating platform OkCupid 2 in order to answer specific problems.
Project Goals
There are two primary goals for this project. The first is to perform EDA - scoping, initial analysis and investigation) on the data, which will then be cleaned and prepared for use in several supervised machine learning algorithms. The second goal is to use this data to see if it is possible to determine the zodiac sign of a OkCupid member, using the information provided in their profile. Zodiac signs have been identified as an important attribute in the dating world, and as a portion of profiles do not provide this information, it would be useful for OkCupid to predict missing zodiac signs, in order to increase the likelihood of successful dating matches.
Data
The data is included within profiles.csv
, which has been provided by Codecademy 3. T The data contains 59946 rows (each representing an individual OkCupid user) and 31 columns. The columns contain information about the age, body type, diet, alcohol and drug consumption, education level, job and income, ethnicity, height, children and pets, sexual orientation, sex, zodiac sign and spoken languages as well as answers to several multiple choice or short-essay style questions.
Analysis
Visualisation and descriptive statistical methods will be used to understand the data, before building and applying four supervised machine learning classification algorithms:
- Linear regression
- K-nearest neighbors
- Random forest
- NLP/Naive Bayes
Results and Evaluation
The results of the machine learning algorithms will be reported and their success levels evaluated. Following this, recommendations for improved results and next steps will be discussed.
2. OkCupid↩
3. Codecademy↩
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import re
%matplotlib inline
okc = pd.read_csv(r'profiles.csv')
(okc.head(5))
#print(okc.dtypes)
(okc.info())
print(f"Number of rows = {len(okc.age)}")
The data provided has the following columns, which provide multiple choice answers:
- body_type - categorical variable
- diet - categorical variable
- drinks - categorical variable
- drugs - categorical variable
- education - categorical variable
- ethnicity - categorical variable
- height - continuous variable
- income - continuous variable
- job - categorical variable
- last_online - date variable
- offspring - categorical variable
- orientation - categorical variable
- pets - categorical variable
- religion - categorical variable
- sex - categorical variable
- sign - categorical variable
- smokes - categorical variable
- speaks - categorical variable
- status - categorical variable
And a set of open short-answer responses to :
- essay0 - My self-summary
- essay1 - What I’m doing with my life
- essay2 - I’m really good at…
- essay3 - The first thing people usually notice about me…
- essay4 - Favorite books, movies, show, music, and food
- essay5 - The six things I could never do without
- essay6 - I spend a lot of time thinking about…
- essay7 - On a typical Friday night I am…
- essay8 - The most private thing I am willing to admit
- essay9 - You should message me if…
Changing the essay question columns into easier to understand variable names
okc.rename(columns={'essay0' : 'self_summary', 'essay1' : 'life_plans', 'essay2': 'good_at', 'essay3' : 'notice_about', 'essay4' : 'favourites', 'essay5' : 'do_without', 'essay6' : 'think_about', 'essay7' : 'friday', 'essay8' : 'private', 'essay9' : 'message'}, inplace=True)
(okc.columns)
An initial brief look at the numerical data within the dataframe
(okc.describe())
okc.sign.nunique()
There appears to be 48 answers to the 'What is your zodiac sign?' question. As there are only 12 possible signs, this data must be looked at more closely.
okc.sign.unique()
The signs are quantified with the importance of zodiac signs to an OkCupid user. Whilst this is interesting information, intially it is best to remove this data and place the cleaned zodiac sign in a new column. Creating a new column means the sign importance data is retained for future use, if required.
okc['sign_clean']= okc.sign.str.split(' ').str[0]
okc.sign_clean.nunique()
sign_labels = list(okc.sign_clean.unique())
sign_labels_nonull = [item for item in sign_labels if not(pd.isnull(item)) == True]
print(sign_labels_nonull)
sign_labels_plt = [x.title() for x in sign_labels_nonull]
print(sign_labels_plt)
The 12 unique zodiac signs are now correctly labelled in the okc.sign_clean
column and the spread is represented in the plot below. The data appears fairly balanced, with capricorn representing slightly less of the users.
sns.set(style = 'ticks')
sns.set_context("notebook", font_scale=1.2, rc={"lines.linewidth": 2.5})
f, ax = plt.subplots(figsize=(15,8))
sns.countplot(data=okc, y='sign_clean')
sns.color_palette("Spectral", as_cmap=True)
ax.set_ylabel('')
ax.set_xlabel('Count')
ax.set_title('Proportion of zodiac signs in the OkCupid data')
ax.set_yticklabels(sign_labels_plt)
plt.show()
sns.set_style('ticks')
#sns.set_context('notebook')
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
f, (ax1, ax2) = plt.subplots(1,2, sharey=False, figsize=(12,6))
plt.subplots_adjust(wspace=0.5)
ax1 = sns.histplot(okc.age, color='purple', ax=ax1)
ax1.set_title('Age distribution of OkCupid profiles')
ax1.set_xlabel('Age')
ax2 = sns.histplot(okc.age[okc.age > 50], color='pink', ax=ax2)
ax2.set_title('Age distribution of profiles above 50 years old')
ax2.set_xlabel('Age')
plt.show()
print(f'Mean age: {round(okc.age.mean(), 2)} ')
print(f'90th percentile: {okc.age.quantile(0.9)}')
print(f'Maximum age: {okc.age.max()} ')
print(f'Minimum age: {okc.age.min()} ')
The mean age of users is 32 years old, with 90% of the data lying below 46 years. The maximum age is 110, which is either impressive, someone didn't want to disclose their age, or is an error. This outlier causes the already left-skewed data to be further skewed. Therefore, the two outliers, at 109 and 110, are best removed in order to minimise their effect on the total distribution of the data, and thus the statistical analysis.
okc = okc[okc.age < 75]
sns.set_style('ticks')
#sns.set_context('notebook')
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
sns.displot(data=okc, x="age", hue="sex", multiple = "stack", palette=['blue', 'red'], kind='hist', binwidth=2)
plt.title('Age distribution of OkCupid profiles of both sexes')
plt.xlabel('Age')
plt.show()
The plot shows that the age distribution for males and females is very similar - but also indicates that more males than females use OkCupid.
print(okc.height.value_counts())
fig, (ax1) = plt.subplots(figsize=(7, 6))
sns.set_style('ticks')
#sns.set_context('notebook')
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
sns.histplot(data=okc, x="height", hue="sex", binwidth=2, multiple = "stack", palette=['blue', 'red'], ax=ax1)
ax1.set_xlim(55, 85)
plt.title('Height distribution of OkCupid users')
plt.xlabel('Height')
plt.show()
print(f'Minimum female height: {okc.height[okc.sex == "f"].min()} inches')
print(f'Minimum male height: {okc.height[okc.sex == "m"].min()} inches')
print(f'Maximum female height: {okc.height[okc.sex == "f"].max()} inches')
print(f'Maximum female height: {okc.height[okc.sex == "m"].max()} inches')
print(f'Average female height: {round(okc.height[okc.sex == "f"].mean(),2)} inches')
print(f'Average male height: {round(okc.height[okc.sex == "m"].mean(),2)} inches')
The minimum heights listed are 4 inches and 1 inches for females and males, respectively. We can assume that people did not want to list their height in their profiles. As heights for both males and females are normally distributed the small values will not be removed - perhaps a certain zodiac sign prefers not to disclose their height. The average height for females is 65 inches, or 5ft 5in and the average height for males is 5ft 10.5in.
sns.displot(okc.income, color='green')
sns.set_style('ticks')
#sns.set_context('notebook')
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
plt.title('Distribution of income for OkCupid profiles')
plt.xlabel('Income')
plt.show()
okc_no_income =(okc.income == -1).value_counts()
percent_no_income = (okc_no_income[1] / (okc_no_income[0] + okc_no_income[1])) * 100
print(okc_no_income)
print(percent_no_income)
It appears that most people, 81 %, prefer not to disclose their income in their dating profile. This could be for several reasons, such as it is often seen as crass to discuss income, or that people do not want money to be a factor in choosing a date. Given this lack of information, income will not be considered when applying models to the data.
Discrete variables
Now we have looked at the continuous variables above, the next section will discuss the discrete variables, which make up the majority of the data.
Sex
The sex distribution of the OkCupid profiles shows that there is a larger proportion of males to females. The data does not include any trans or non-binary labels, which could indicate that perhaps the profiles in the data did not include these groups, only binary options were available or only sex at birth were considered.
plt.pie(okc.sex.value_counts(), labels=['Male', 'Female'], colors=['blue', 'red'], autopct='%0.1f%%', explode=[0.02]*2)
plt.axis('equal')
plt.title('Gender distribution of OkCupid profiles')
plt.show()
def plotting_tool(df, x, width, height):
sns.set_style('ticks')
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
sns.color_palette("Spectral", as_cmap=True)
plt.figure(figsize=[width, height])
plt.subplots_adjust(wspace=0.5, hspace=0.3)
for i in range(len(x)):
plt.subplot(1, len(x), i+1)
sns.countplot(data=df, y=x[i])
sns.color_palette("Spectral", as_cmap=True)
plt.title(f'{x[i]}')
plt.ylabel('')
plt.xlabel('Count')
okc['diet_clean'] = okc.diet.str.split(' ').str[-1]
#print(okc.diet_clean)
diet_body = ['diet', 'diet_clean', 'body_type']
plotting_tool(okc, diet_body, 20, 10)
f, ax=plt.subplots(figsize=(15,6))
sns.color_palette("Spectral", as_cmap=True)
ax = sns.countplot(data=okc, x = 'body_type', hue='diet_clean')
ax.set_title('Distribution of dietary choices among different body types')
plt.xticks(rotation=20)
plt.xlabel('Body type')
plt.ylabel('Count')
plt.show()
The majority of users are omnivores, with the next most prevalent choice being vegetarian. This is reflected in the relationship between diet and body type, with the majority dietary choice of each body type being omnivore and the second being vegetarian.
lifestyle_choices = ['smokes', 'drinks', 'drugs']
plotting_tool(okc, lifestyle_choices, 20, 6)
non_smoker = (okc.smokes == 'no').value_counts()
smoke_percent = round((non_smoker[1] / (non_smoker[1] + non_smoker[0])) * 100, 2)
social_drinker = (okc.drinks == 'socially').value_counts()
drinker_percent = round((social_drinker[1] / (social_drinker[1] + social_drinker[0])) * 100, 2)
uses_drugs = (okc.drugs == 'never').value_counts()
drugs_percent = round((uses_drugs[1] / (uses_drugs[1] + uses_drugs[0])) * 100, 2)
print(f'Percentage of users that drink socially {drinker_percent}%')
print(f'Percentage of users that never use drugs {drugs_percent}%')
print(f'Percentage of users that do not smoke {smoke_percent}%')
The data shows that the majority of users (~ 70 %) drink socially, do not use drugs or smoke.
attrib2=['education', 'job']
plotting_tool(okc, attrib2, 20, 10)
print(okc.education.value_counts())
There are a lot of options within the education column, with the majority of answers indictating users graduating from or attending college/university. As the variable is dominated by people attending/completing college, it will not be used within the model. It is of note that space camp seems to be strangely popular. There is a range of employments, with no one industry dominating the answers, therefore this will be included in the model.
rel_status = ['status', 'orientation']
plotting_tool(okc, rel_status, 20, 6)
single = (okc.status == 'single').value_counts()
single_percent = round((single[1] / (single[1] + single[0])) * 100, 2)
print(f'The percentage of users that are single: {single_percent} %')
straight = (okc.orientation == 'straight').value_counts()
straight_percent = round((straight[1] / (straight[1] + straight[0])) * 100, 2)
print(f'The percentage of users that identify as straight: {straight_percent} %')
93% of users identify as single, which is unsurprising as OkCupid is a dating site. OkCupid also lets people identify as polyamorous or in open relationships, which accounts for the 7% that are not single. As the 'single' result dominates the answers, the relationship status variable will not be used in the model. A poll in the US on sexual orientation 1 found that 7.1% of people identified as LGBT. The OkCupid data shows 14% of people identify in this category. As this is twice the poll average it may be of benefit to the model so orientation will be included.
1. Gallop poll↩
pet_status = ['pets']
plotting_tool(okc, pet_status, 10, 6)
children_status=['offspring']
plotting_tool(okc, children_status, 10,6)
Most users appear to like both cats and dogs, with those liking dogs making up the second most popular answer. Most people do not have children (with qualifying information whether they want them in the future or not). Given the average age and the fact it is a dating site, it is unsurprising most people do not have children. The offspring data will not be included in the model as it is not easily simplified, thus making it hard for the model to learn anything from.
print('Cleaning the data to remove the qualifying information')
okc['religion_clean'] = okc.religion.str.split(' ').str[0]
#print(okc.religion_clean)
reli = ['religion', 'religion_clean']
plotting_tool(okc, reli, 20, 15)
As with the zodiac data, the religion column also contains qualifying information. The data has been cleaned into a new column religion_clean
to remove the qualifying data. There seems to be a spread of religions, with no one answer dominating the data.
columns_for_model = ['body_type', 'diet_clean', 'job', 'pets', 'religion_clean', 'orientation', 'sex', 'sign_clean']
print(len(okc.diet_clean))
Number of null values in each column
nulls = (okc[columns_for_model].isnull().sum(axis = 0))
print(nulls)
nulls_values = []
for null in range(len(nulls)):
nulls_values.append(nulls[null])
print(nulls_values)
Producing a dataframe of values
null_comb = list(zip(columns_for_model, nulls_values))
print(null_comb)
null_df = pd.DataFrame(null_comb, columns=['Column', 'NaN'])
null_df['% NaN values'] = round((null_df.NaN / (len(okc))) *100, 2)
display(null_df)
The null values within the data need to be addressed. As the model will be looking at zodiac signs, the NaN values in the sign_clean
column are best removed. As there are not a large number of null values (9% and 14%, respectively) in the body_type
and job
columns, these null values will also be dropped.
columns_to_remove_nan = ['sign_clean', 'body_type', 'job']
for n in columns_to_remove_nan:
print(n)
def remove_nulls(df, cols, df_columns, df2):
remaining_null = []
for n in cols:
df = df.dropna(subset=[n])
drop_null = df[df_columns].isnull().sum(axis = 0)
for value in drop_null:
remaining_null.append(value)
df2[f'NaN after {n} null drop'] = pd.Series(remaining_null)
remaining_null = []
return df, df2
okc_drop, null_df = (remove_nulls(okc, columns_to_remove_nan, columns_for_model, null_df))
print(len(okc_drop))
display(null_df)
If null values were removed from diet_clean
, religion_clean
and pets
then a large chunk of the data would be removed (over 40%).
print(okc_drop.diet_clean.unique())
print(okc_drop.pets.unique())
print(okc_drop.religion_clean.unique())
Therefore, the null values are replaced with 'unknown' as perhaps certain zodiac signs have no opinion on pets, dietary or religious choices.
okc_drop =okc_drop.fillna('unknown')
print(okc_drop.diet_clean.unique())
print(okc_drop.pets.unique())
print(okc_drop.religion_clean.unique())
print(okc_drop[columns_for_model].isnull().sum(axis = 0))
okc_model_df = okc_drop[['body_type', 'diet_clean', 'job', 'pets', 'religion_clean', 'orientation', 'sex', 'sign_clean']]
The data is now clear of null values.
features = ['body_type', 'diet_clean', 'job', 'pets', 'religion_clean', 'orientation', 'sex']
okc_dummies = pd.get_dummies(data=okc_model_df, columns=features)
display(okc_dummies)
#print(okc_dummies.info())
okc_dummies['sign_labels'] = okc_dummies.sign_clean.astype('category').cat.codes
labels = okc_dummies[['sign_labels']]
labels_array = labels.squeeze().ravel()
print(labels_array)
Import library to split the data for testing and training.
from sklearn.model_selection import train_test_split
The data will be split 80% for training the model and 20% for validating the model.
X_data = okc_dummies.iloc[:, 1:-1]
y_data = okc_dummies.iloc[:, 0:1]
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=100)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
Converting the y data into an array.
y_train = y_train.to_numpy().ravel()
y_test = y_test.to_numpy().ravel()
Models
Logistic Regression
Logistic regression is a machine learning algorithm that predicts the probability (ranging from 0 to 1) of a datapoint belonging to a specific category. These probabilities are used to classify/assign the observations to the more probable group. An example of this is using a logistic regression model to predict the probability that an incoming email is spam. If the probability is greater than 0.5, the email could automatically be sent to the spam folder. The email is example is called binary classification as there are only two groups (i.e. spam or not spam).
As the zodiac data is not binary, a 'multinomial' argument can be passed to the model, so that it may classify more than 2 groups.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
logres_model = LogisticRegression(max_iter=4000, multi_class='multinomial')
logres_model.fit(X_train, y_train)
logres_model.predict(X_test)
logres_test_score = logres_model.score(X_test, y_test)
lr_perc_success = round(logres_test_score * 100, 2)
labels_chart = logres_model.classes_
print(lr_perc_success)
The model was only 8.5% successful in predicting a zodiac sign.
A confusion matrix is a tool that allows us to visualise the performance of a classification machine learning model. The matrix compares the actual target values with those predicted by the model.
conf_mat = confusion_matrix(y_test, logres_model.predict(X_test))
fig, ax = plt.subplots(figsize=(15,10))
sns.heatmap(conf_mat/np.sum(conf_mat), annot=True, fmt = '.1%', cmap = 'Spectral', xticklabels=labels_chart, yticklabels=labels_chart)
plt.xlabel('Predicted Features')
plt.ylabel('True Features')
plt.show()
The above confusion matrix is showing that the model is unsuccessful in predicting any zodiac signs, with incorrect classifications across the board.
Logistic regression is generally used for continuous variable predictions not classification, therefore the lack of success above is unsurprising.
K-Nearest Neighbour (KNN)
KNN is a classification algorithm with central idea that data points with similar attributes tend to fall into similar categories. The KNN algorithm utilises this 'feature similarity' to predict the values of unknown/new data points. Therefore, a new point (or point from the test set) is assigned a value based on how closely it resembles the points in the training set.
from sklearn.neighbors import KNeighborsClassifier
The default number of neighbors within the algorithm is 5. As there are 12 zodiac signs, the model will initially be set to 12 nearest neighbors.
knn_model = KNeighborsClassifier(n_neighbors=12)
knn_model.fit(X_train, y_train)
predict_knn = knn_model.predict(X_test)
success_rate = knn_model.score(X_test, y_test)
print(round(success_rate * 100,2))
Like with the logistic regression model, the success at predicting a zodiac sign is between only 8 and 9 %. This is, again, highly inaccurate, so the measures of algorithm effectiveness are investigated below.
print(accuracy_score(y_test, predict_knn))
print(recall_score(y_test, predict_knn, average='weighted'))
print(precision_score(y_test, predict_knn, average='weighted'))
print(f1_score(y_test, predict_knn, average='weighted'))
The accuracy, recall, precision and F1 scores are all between 8 and 9%, indicating that the KNN algorithm is highly ineffective in predicting a users zodiac sign.
accuracy_values = []
for k in range(1,100):
knn_model_neighbor = KNeighborsClassifier(n_neighbors=k)
knn_model_neighbor.fit(X_train, y_train)
accuracy_values.append(knn_model_neighbor.score(X_test, y_test))
print(accuracy_values)
import altair as alt
k_values = range(1,100)
knn_list = zip(k_values, accuracy_values)
knn_df = pd.DataFrame(knn_list, columns=['Number of neighbors', 'Model accuracy values'])
alt.Chart(knn_df).mark_line().add_selection(
alt.selection_interval(bind='scales', encodings=['x'])
).encode(
alt.X('Number of neighbors', type='quantitative', axis=alt.Axis(title='Number of neighbors', grid=False)),
alt.Y('Model accuracy values', type='quantitative', axis=alt.Axis(minExtent=30, title='Model accuracy score', grid=False), scale=alt.Scale(zero=False)),
tooltip=['Number of neighbors:Q', 'Model accuracy values:Q']
).properties(
width=700,
height=400
).configure_axis(
labelFontSize=16,
titleFontSize=20
).interactive()
print(max(accuracy_values))
The most successful prediction was 9.2%, with 9 nearest neighbors. This is still not a very successful model prediction, so another classification algorithm will be utilised.
from sklearn.ensemble import RandomForestClassifier
forest_model = RandomForestClassifier(n_estimators=20)
forest_model.fit(X_train, y_train)
fmp = forest_model.predict(X_test)
print(forest_model.score(X_test, y_test))
print(forest_model.feature_importances_)
accuracy_trees = []
for n in range(1,100):
forest_model_trees = RandomForestClassifier(n_estimators=n)
forest_model_trees.fit(X_train, y_train)
accuracy_trees.append(forest_model_trees.score(X_test, y_test))
print(accuracy_trees)
n_values = range(1,100)
#plt.plot(n_values, accuracy_trees)
#plt.show()
tree_list = zip(n_values, accuracy_trees)
tree_list_df = pd.DataFrame(tree_list, columns=['Number of trees', 'Model accuracy values'])
alt.Chart(tree_list_df).mark_line().add_selection(
alt.selection_interval(bind='scales', encodings=['x'])
).encode(
alt.X('Number of trees', type='quantitative', axis=alt.Axis(title='Number of trees', grid=False)),
alt.Y('Model accuracy values', type='quantitative', axis=alt.Axis(minExtent=30, title='Model accuracy score', grid=False), scale=alt.Scale(zero=False)),
tooltip=['Number of trees:Q', 'Model accuracy values:Q']
).properties(
width=700,
height=400
).configure_axis(
labelFontSize=16,
titleFontSize=20
).interactive()
print(accuracy_score(y_test, fmp))
print(recall_score(y_test, fmp, average='weighted'))
print(precision_score(y_test, fmp, average='weighted'))
print(f1_score(y_test, fmp, average='weighted'))
The decision tree model had no greater success than linear regression or k-nearest neighbor in predicting the zodiac sign of an OkCupid user, with all models having a success rate of > 10 %.
NLP and Naive Bayes
Naive Bayes classifiers are supervised machine learning algorithms that leverage a probabilistic theorem to make predictions and classifications. They are widely used for sentiment analysis (determining whether a given block of language expresses negative or positive feelings) and spam filtering.
Here a Naive Bayes classifier will be used to analyse the essay questions within the dataset to see if NLP has better success in predicting a users zodiac sign.
from sklearn.feature_extraction.text import CountVectorizer
import re
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
import nltk
#nltk.download('punkt')
#nltk.download('stopwords')
#nltk.download('wordnet')
Creating the dataframe and cleaning null values.
NB_df = okc[['sign_clean','self_summary', 'life_plans', 'good_at', 'notice_about', 'favourites',
'do_without', 'think_about', 'friday', 'private', 'message']].copy()
#remove null values
NB_df = NB_df.dropna(subset=['sign_clean','self_summary', 'life_plans', 'good_at', 'notice_about', 'favourites',
'do_without', 'think_about', 'friday', 'private', 'message'])
print(NB_df.isnull().sum())
print(len(NB_df))
Defining functions which utilise regex in order to clean the text, such as removing symbols and extra white spaces.
def regex_function(text):
return re.sub('<.*?>|\\n+|http\S+|(?<=&)(.*?)(?=;)|,|\.|\:|;|-|/|&|!|\?|\(|\)|\+|@', ' ', text)
def remove_extra_whitespace(text):
return re.sub(r'\s+', ' ', text)
essay_list = ['self_summary', 'life_plans', 'good_at', 'notice_about', 'favourites', 'do_without', 'think_about', 'friday', 'private', 'message']
for essay in essay_list:
NB_df[essay] = NB_df[essay].apply(lambda x: regex_function(x))
NB_df[essay] = NB_df[essay].apply(lambda x: remove_extra_whitespace(x))
NB_df[essay] = NB_df[essay].str.lower()
(NB_df.head())
Creating a dictionary of zodiac signs to map the dataframe to convert the signs into integers.
no_list = list(range(0,12))
print(no_list)
zodiac_list = list(okc_model_df.sign_clean.unique())
print(zodiac_list)
map_dict = dict(zip(zodiac_list, no_list))
print(map_dict)
#mapping the signs to integers
NB_df['sign_int'] = NB_df.sign_clean.map(map_dict)
(NB_df.head())
Creating a corpus column for use in model, in which all essay questions per row are joined into one string.
NB_df['corpus'] = NB_df[essay_list].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
Writing the dataframe to csv, in order to check corpus is correct
NB_df.to_csv('NB_df1.csv')
Preparing the data for use in Naive Bayes analysis, through definition of a function that will tokenize, stem and lemmatize the text, as well as removing stop words.
def NLP_processing(text):
tokenized = word_tokenize(text)
stemmer = PorterStemmer()
stemmed = [stemmer.stem(token)for token in tokenized]
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token) for token in stemmed]
stop_words = set(stopwords.words('english'))
output = [x for x in lemmatized if x not in stop_words]
output = ' '.join(output)
return output
NB_df.corpus = NB_df.corpus.map(lambda x: NLP_processing(x))
The data will be split into training and test sets before vectorizing - to avoid training data leaking into the test data. The data will be split 75% for training, 25% for testing.
corpus = NB_df.corpus
labels = NB_df.sign_int
NB_corpus_train, NB_corpus_test, NB_labels_train, NB_labels_test = train_test_split(corpus, labels, test_size=0.25, random_state=100)
print(NB_corpus_train.shape)
print(NB_labels_train.shape)
print(NB_corpus_test.shape)
print(NB_labels_test.shape)
Creating and fitting the counter.
counter = CountVectorizer()
counter.fit(NB_corpus_train)
NB_train_counts = counter.transform(NB_corpus_train)
NB_test_counts = counter.transform(NB_corpus_test)
Creating and fitting the classifier
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(NB_train_counts, NB_labels_train)
print(classifier.score(NB_test_counts, NB_labels_test))
The model again has a success rate of ~ 9%, which is still disappointingly low. Below the confusion matrix will be utilised to see if the model was better at predicting some zodiac signs over others.
predictions = classifier.predict(NB_test_counts)
print(labels.value_counts(normalize=True))
matrix = confusion_matrix(NB_labels_test, predictions)
conf_mat2 = confusion_matrix(NB_labels_test, classifier.predict(NB_test_counts), normalize='true')
fig, ax = plt.subplots(figsize=(15,10))
sns.heatmap(conf_mat2/np.sum(conf_mat2), annot=True, fmt = '.1%', cmap = 'Spectral', xticklabels=zodiac_list, yticklabels=zodiac_list)
plt.xlabel('Predicted Features')
plt.ylabel('True Features')
plt.show()
Again, the confusion matrix shows that the model is highly unsuccessful in predicting zodiac signs as it has struggled to make a distinction between any of the signs.
All the machine learning models above have failed to adequately predict zodiac signs. This will be discussed in the results and evaluation section below.
Results
The success of each machine learning algorithm, in predicting as OkCupid users zodiac sign were:
- Linear Regression: 8.5 %
- K-Nearest Neighbor: 9.1 %
- Random Forest: 9.3 %
- Naive Bayes: 8.9 %
All models have a success rate lower than 10%, which makes them all highly unsuccessful. Reasons for this, and suggestions for improvements are discussed below.
Conclusion
The goal of this project was to accurately predict the zodiac signs of OkCupid users from information provided in their profile through implementation of supervised machine learning classification models. Four different algorithms were utilised (linear regression, K-nearest neighbor, random forest and Naive Bayes classifier), with none achieving a success rate above 8-9 %. As there are 12 zodiac signs, a random guess has a 1 in 12 (or 8%) chance of being correct. This shows that a person guessing a zodiac sign has the same odds of guessing correctly as the machine learning model.
The conclusion we can take from this is that a person's lifestyle choices and behaviours are not governed by their zodiac sign, meaning it is not possible to predict their zodiac sign from the data within a person's OkCupid profile.
Next steps
This project could be further investigated by:
- Obtaining a larger dataset, as with more information the models may have greater success in learning the difference between the features of the 12 classes.
- As there are 12 zodiac signs, the number cannot be decreased. However the question posed to the ML model could be simplified/made binary i.e. Is a user a capricorn?
- Are there better ways to preprocess or visualise the data?
- Are there other aspects of the data a machine learning algorithm could investigate, such as can you predict a users sex, religion or drink/drug habits?