Process Book for Project Data at Large





Overview & Motivation

Internet has become an integral part of modern life. As we take advantage of the conveniences that the digital age offers, we are becoming increasingly vulnerable to cybercrimes such as fraud and identity theft. Inspired by the recent data breach incident from Equinfax that affected over 143 million people in the United States and a visualization of company data breach cases, we would like to investigate and understand the increasingly serious issues with data breaches, use visualizations to raise the public’s awareness in monitoring the safety of their personal data, and offer approaches in cybersecurity protection.

The project is consisted of 4 main parts: We begin by describing the rise of the Internet age and transitioning to a brief background of hackers and why they hack. Subsequently, we show selected incidents of company data breaches to make the point that the data we implicitly entrust to companies are susceptible to cyberattacks. The consequences of these data breaches are often cybercrimes on an individual level, so we follow up by visualizing identity theft trends and the time needed to resolve it. Lastly, we aim to make suggestions on creating a safer online existence.



Goals & Questions

The main goal of this project is to use visualizations to raise the public’s awareness in monitoring the safety of their personal data, and offer recommendations in cybersecurity protection. To do so, we would like to explore the following questions:

  1. Concerning the internet and the hackers --
    • How has internet usage in the U.S. evolved over time?
    • Who are hackers? Why do they hack?

  2. Concerning the data that we entrust to companies --
    • How secure is the data we entrust to companies while we use the internet?
    • Are incidents of company data breaches increasing over time?
    • Which type of breach are companies most susceptible to?
    • How sensitive are the data being breached (e.g. contents of emails, credit card information, or social security)?
    • Are there certain types of organization more prone to certain type of data breaches?

  3. Concerning the consequences of data breaches --
    • What is the trend of identity theft in each state over time?
    • How long does it take to resolve identity theft issues

  4. Concerning creating a safer online existence --
    • What are the steps we can take to improve the safety of our online existence?



Project Tasks

  1. Data acquisition and wrangling: Data were acquired from the sources listed in Data Sources section. Most of the datasets are fairly clean, so we performed minimal data cleaning in this step. However, many of the datasets came in the form of .xlsx files with multiple excel sheets per data file, which we converted into .csv or .json files. We also implemented a python script ("WorldBankDataParse.py" in the data/internet-use folder) to parse data from the World Data Bank into .csv files.

  2. Visualization and interactivity implementation: We designed and implemented 10 visualizations, and the details of these visualizations are shown in Design Evolution section.

  3. Acquisition/production and compilation of texts for storytelling: We read up on new articles about cybersecurity and collected the ones that we felt helpful for telling a story and making the website more informative.

  4. Website design and implementation: The layout of the website went through multiple design iterations as we collected user feedback and reflected retrospectively. We aimed to use web design to facilitate user understanding of how to explore the visualizations as well as the story telling aspect.



Data Sources

All of our data sources are listed in the main website. Here, we re-present data sources as they relate to our project sections.



  1. Data Sources for concerning the internet and the hackers --

  2. Data Sources concerning the data that we entrust to companies --

  3. Data Sources concerning the consequences of data breaches --

  4. Concerning creating a safer online existence --



Design Evolution

Here, we present the design evolution of each of our 10 visualizations, which were grouped into 6 composite visualizations. As you will see, most visualizations underwent drastic re-designs with the intention to achieve a balance between story telling and visual effect.

Visualization 1 - Internet usage

The major point we wanted to make with this visualization is that internet had taken over the world very quickly since its commercialization in 1995. In the initial design, we hoped to achieve this by using a set of graphs that displayed the trend from different perspectives. We imagined that we would use a slide show format to deliver these graphs visually.

Initial design

Final design

We felt that the initial design, while capable of conveying our point that internet usage had grown tremendously, line and bar charts are not very eye-catching, especially serving as the very first visualization of the project. Therefore, we gathered additional data from the Wrold Data Bank in order to be able to plot a choropleth. We also enabled more interactivity by introducing animation and buttons to switch views. The linear growth in the line chart, coupled with the darkening map, tell a much more attention-grabbing story than the previous design.

Visualization 2 - About hackers

Initial design

This is one of the visualizations for which we not only changed the design, but we completely changed the content. Originally, the designed story flow was to visualize various online activities and compare their potential for exposing the user to cybersecurity threats. The design sketch is shown below.



After having implemented the bar charts, we felt that a plain bar chart was not a very exciting visual transition. Fortunately, we came across a hackers survey that we found interesting and informative. We thought giving some insights into why and how hackers came to be would be a better transition to discussing cybersecurity issues since we wouldn't be nearly as worried about cybersecurity if there weren't any hackers.

Second design

This design allows the user to hover over the reasons donut chart to read more about the reasons why hackers hack. It also allows the user to hover over the pie charts to see the results of 3 interesting survey questions that revealed just how easy it is for hackers to hack into a system and how most hackers feel sympathetic/ empathetic to other hackers who have been arrested.

One problem with this design is the coloring of the pie charts. Each color represents one category of response from the survey, but the color scheme overall does not reflect the message in the annotations. For instance, the annotation, 88% of hackers can break into a system in less than 12 hours, reflects the sum of 3 answers (0-2 hours, 2-6 hours and 6-12 hours). Therefore, we decided to color the pie charts to match the message rather than simply color each response with a different shade. This resulted in the final design as shown below.

Final design

In the final design, we also animated the appearance of the charts and annotations to direct viewer attention to the relevant elements.

Visualization 3 - Data breach cases - Stacked Area/Bubble Chart

Initial design

The goal of this visualization is to visualize the overall trend of data breach cases and also allow more detailed cross-sectional views of the cases by clicking certain area of the stacked area chart, or referring to the bubble chart in the right.

Final design

We preserved most of the decisions made in the initial design, with only a few tweaks. As we developed the visualization, we realized that in the bubble chart, the variables which are mapped to the y-axis and size, actually have the same values in the data. Essentially, we only have one dimension to visualize. We tackled this problem by modifying the y-axis to the index of individual breach cases. So essentially we are stacking breach cases into pillars, with the size of the circle indicating the severity of the breach.

The interactivity also worked as expected. When an area in the stacked area chart is clicked, the bubble chart filters accordingly to show only the selected category. Both charts are able to listen to the "visualize by" drop-down selection to view different variables.

Visualization 4 -Data breach cases - Parallel Coordinates

Initial design

This chart uses the same data as Visualization 3, but it presents the data in a fundamentally different way. The goal of this visualization is to allow the viewers to dive in the dataset and uncover their own understanding. A parallel coordinate chart is a great way for this purpose. Our design allows the viewer to specify the axes, color by different axes and brush on individual axis to filter observations.

Final design

The changes we made mainly lie in the animation. In our testing iterations, we received some feedback regarding this graph being too cluttered without a key message. Some also reflected that they were not sure how to use the visualization. Therefore, we decided to adopt a more guided approach. The grey panel in the middle first show instructions (view 1). The viewer is then prompted to click. Upon clicking, the visualization appear with the first default view (view 2), which is meant to show a specific message. The accompanying description and more instruction also appear in the grey panel. We decided to provide 2 default views (view 2 and view 3). Finally, after yet another viewer click, the visualization is revealed in its entirety (view 4) to allow maximum interactivity and user exploration.

View 1

View 2

View 3

View 4

Visualization 5 - Trends in identity theft

Initial design

Second design

One main drawback of a polar area chart is that viewer would only be visually drawn to the most visible 3-4 slices. Therefore, the polar area chart might not be the most suitable type of graph, especially given that there would have been 50 slices to represent 50 states.

For this reason, we re-designed the visualization and used a modified version of a choropleth chart instead. In this design, we represent each state not based on its actual geographical shape but squares with uniform areas. This removes the lie factor that often is associated with choropleths.

By clicking on to each area, a line chart and a table corresponding to the state would be displayed to the right of the map to show the trend over the years for the selected state.

Final design

Having implemented the second design, we realized that having a map, a line chart and a table on the same page is a bit overwhelming visually. The line chart and table also served the same purpose. Therefore, we decided to implement the line chart as a tooltip, appearing on mouseover, and remove the table altogether. This created a much cleaner view that allows the user to focus on the spatial distribution first, and then drill down to temporal distribution of specific states.

Visualization 6 - Time to resolve identity theft cases

Initial design

The main point of this visualization is to show the distribution of the time needed to resolve identity theft cases. To avoid ending the project with a rather conventional form of visualization, we wanted to capture audience interest by implementing an interactive/game-like visualization that contain some randomness.

Second design

Since the dart visualization had been implemented in previous projects, we felt that it was best for us to come up with an original design that could end the project on a high note. After consulting with our TF, Zona, the idea that we finally settled on was a grid of squares where the user would be prompted to click. The click would then trigger a series of animation to reveal first a color that represents a category of time spent on resolving identity theft followed by the whole distribution. The first color is randomly chosen based on the distribution of previous victims data.

The actual implementation is shown below.

Final design

View 1

View 2

View 3

View 4

Project Logistics

The work load of this project was split evenly across team members, and we were in constant communication to make sure everyone was on the same page and agreed to the designs. While everyone had a fair chunk of all aspects of the project (from web design to implementing visualizations), each member also had focus areas, and they are shown below:



Member Role
Michelle Project organization (github repo, overall website design)
Evaluation of project targeting
Design master
Cindy Drawing of sketches and organization of visualization/interaction ideas
Implementation of website animations
Face of the project (presentation and screencast voice-over)
Ziqi Maintenance of the process book
Preprocessing of the data
Code master



We roughly followed the time line below:



Closing Thoughts

We hope that you have not only enjoyed our interactive visualizations but also learned something interesting and practical about protecting your personal cyber space. It has been tremendously fun to apply what we've learned in CS171 and to explore beyond what was taught. Creating complex interactive visualizations and designing both user-friendly and visually appealing website were by no means easy tasks, but it was a great learning journey, and we feel very satisfied with what we've achieved as a team.