Part 1: Exploratory Data Analysis and Visualisation

5 min readDec 6, 2020

Data can tell stories. With numerous variables and data frames, it is crucial to visualize these stories to communicate relationships, trends, and challenges within the data. This article explains Exploratory Data Analysis and how we applied it to the career village dataset.

Introduction t0 Exploratory Data Analysis

Exploratory Data Analysis (EDA)

EDA is an approach for data investigation and summarization using tables, graphs, and diagrams. It is mostly used to explain the data using visual elements, making it easy to interpret and understand the raw data.

Career Village Data

We used the career village dataset from Kaggle. The dataset has 15 files. Each one of them is related to each other. It is hard to understand these connections with many files, where missing values are, and trends. In this article, we will provide a tutorial for how to perform EDA to explain raw data.

What a better start for a python tutorial than to import some libraries.

After downloading the data from Kaggle, we load it using Pandas.

Question 1: how do the files connect?

Present how the data connects using Graphviz

Network Diagram Shows the Connections among Files

What does Figure say?

If you open the career village website, you will find questions posted by students who have school memberships, group membership. Professionals sign up to answer these questions. Some of the professionals sign up for emails to get questions matching their area of expertise. Both questions and answers have been scored. All users can have tags and can comment on answered provided to questions.

Question 2: Do We Need to Perform Data Pre-processing?

To answer that question, we need to understand if the data format matches the columns' content and any missing values.

In each of the data frames, we have columns that correspond to dates. Thus, we need to change their type to be Pandas Datetime reflecting that.

Afterward, we want these dates to reflect more information about the users’ activity, students, and professionals. Thus, we can utilize the dates of answers added, questions added, comments added, emails sent, professionals registration, and student registration.

After we pre-processed the dates to reflect the users' activity, we need to check for missing data before we perform any further analysis using the new columns. To do so, we will use the following code to visually illustrate the missing data.

The code produces these figures.

What Do These Figures Say?

The professionals' data shows that location, industry, tags, and headlines have the highest percentage of exciting data, where groups, comments, answers, and schools had the lowest. This is a plausible representation of the data since there are more questions than answers. However, the missing data shows that most professionals are not registered for groups and not commenting on answers.

The students' data show that location has the lowest missing-data-rate, similar to the professionals’ dataset. It also shows that students tend to have fewer tags, comments, and groups.

Since location is the most available variable in the datasets, let’s see if we can see any trends worth noting.

Question 3: Are There any Trends in the Location Variables?

We plot the top ten locations for both students and professionals using the following code.

The code produces the following plot.

For both students and professionals, the top location is New York, New York. All of the professionals are located in the United States. However, the second and ninth places are located in India. This representation highlights how the users’ based in the United States and India, and hence, it suggests there is a growth of the users. To investigate the growth of the users, we plot the growth of users over time.

Question 4: Are There any Trends in the user Growth over the Year?

To answer this question, we need to get the unique users’ ID and their date of joining the website. Then, we gather all of the years recorded for each of the users. Lastly, we sum the number of users registered in a specific year. These steps are performed for both students and professionals.

The output of the code is a line graph that reflects the number of users registered for a specific year.

The line graphs show that there is an increased number of users joining the platform whether they are students or professionals. The rate of growth was the highest between 2015 and 2017 reflected in the steep slopes for both figures. However, the number of students is higher than the number of professionals which suggests the presence of a gap between the demand for questions sent by students and the number of professionals signed up to answer these. Another trend to investigate is the growing interest in specific. topics or words in tags of questions, answers, and the professionals.

Question 5: What are the Trends in the Tags Used?

To visualize these tags, we use the wordcloud library. We use the user tags for students, questions, and for professionals. We plot the top 20 most frequent words used, using the following code.

The word cloud is a very visually appealing tool to represent the frequency of words by changing the word font size. We can see that the words college and career are most frequent in questions and students tags where telecommunications is the most frequent word used by the professionals. Aside from the visual appeal of these figures, they reflect the demographics of the students. For instance, the presence of words like engineering and medicine reflects the increased demand for science-related questions within the student base while the professionals' tags show a more diverse set of subjects of interest. In addition, the difference in the tags used by the students and the professionals explains the gap between questions and answers.