One of my “secondary jobs” was the one of data analyst – “secondary” means that I always provided metrics while doing something else (it’s quite common for an officer to be involved in more than just one “job position”), and this is good because probably the greatest value provided by a data analyst is when he is also expert in the field the data are referring to. But I’m not here to write about I, T, M (or whatever letter) shapes (it’s about skills, you can read more on our friend Wikipedia). This post is intended to give you a (not-so-)little information about the data analysis process. If you’re like me, probably you started tracking some metrics (like time, money and everything else) and making some calculations, but obviously at a very early age (let’s say 6-8 y.o.) you don’t master anything different from sum and maybe arithmetic mean, moreover you usually don’t deal with complex data before you start serious projects or begin to work with (serious) analytics. This is where a consistent, clear and reproducible method is not only desiderable, but absolutely necessary. What I’ll try to summarize here, adding here and there some notes from my long experience with data, is a course I completed on my best friend Coursera (if you read my blog, you already know I completed hundreds of courses, I really suggest you to learn something, maybe through MOOCs, during your short life). The course I’m talking about is “Google Data Analytics Professional Certificate” (professional certificate, in Coursera, means it’s provided by a company and consists of multiple single courses).

A very few words before starting

Before presenting the concepts and procedures, just in case, I repeat a well known disclaimer: the content here is for educational purpose only under fair use, plus I suggest you to study from the source, for several reasons I already explained about summaries).

A very quick point to understand the context and avoid confusion: data analysis is a part of data analytics (that can include also data science to create new ways of modeling and undertanding the unknown by using raw data). So, to clarify, I’m writing about: collection, transformation, organization of data to draw conclusion, predictions and drive informed decision-making. Generally speaking, there’s a correlation in required technical skills and value of data, in this order from low to high: descriptive (what happened?), diagnostic (why happen?), predictive (what will happen?), prescriptive (how can we make it happen?).

Just a quick reminder that can be not-so-obvious for everyone:

while this “professional certificate” is (in my opinion) really good for a beginner without prior experience (even if a little experience with data in real world can be useful to better “see” the usefulness and the purpose), every similar course could be good as well, as long it’s well done and covers all the basics, with a good balance between theory and practice: you don’t want to be just a theoric in the statistical field, nor you want to be only a monkey pressing keys on your keyboard without understanding what lies beneath – this is, in my personal (and strong) opinion, one of the major risks in the near future, having the average worker mindlessy clicking on something without really understanding input, process and output;
don’t focus too much on specific tools: not only they change quite often (unlike the basic concepts behind), but they’re so many that you can’t even have time to just look at them all – just to make an example: focus on learn to properly use a spreadsheet, whatever it is, since you can “port” your knowledge of formulas from one to another, with just minor changes (well, not always, specially if you’re really an advanced user with almost unknown power tools in Excel, but this is not the case); same applies to programming languages (even if everyone knows that Python is better than R :P);
as everything in life: don’t be obsessed with results, love the process, enjoy the journey 😉

Image created by me with Stable Diffusion

You will find here the processes used by a data analyst, but remember that there are also other professionals involved in data, like Data Engineers (Develop and mantain DB and related systems, Responsible of Reliability, Transform data to be correctly used) and Data warehouse specialist (Secure data, Availability, Backup to prevent loss). This is valid usually for big companies with a solid department working with data, but in very small companies it can happen that a data analyst could be responsible of everything, from choosing and installing a software to make a periodic backup), but we will focus on the core activities of a data analyst.

That said, it’s now time to start!

The overall process at a glance

Data analytics process, scheme by Google course

Every journey begins with a step, but before you start you need to have a clear destination in mind (don’t worry, sometimes it can be just a direction; you will adjust as you walk). It all starts with questions: just as in life, the quality of your results depends largely on the quality of your questions. That’s why “ask” is probably the part I love more. After all, it reminds me that project and time management are useless without purpose. For a similar reason, I also like the preparation phase, keeping in mind the famose quote “If I had five minutes to chop down a tree, I’d spend the first three sharpening my axe”. Maybe one of the reasons I place emphasis on these 2 phases is because I still remember the breath on my neck by some incompetent senior officers, yelling: “Move, I don’t see you doing anything with the data, do your engineering thing!” (I had a hard time explaining that thinking is often the most important thing before you start handling data, but I wasn’t surprised, since those who started moving in random directions were rewarded, even if often causing double the effort and lost time to return to the initial point).

Before going deeper, I would like to emphasize that knowing how to analyze data is something that is needed in every area of life, not just work – even if probably you won’t apply all the 9 stages for every details in your life:

Big Data Analytics Lifecycle – from “Big Data Fundamentals: Concepts, Drivers & Techniques“

Last important thing, keep in mind the data analyst mindset, with these valuable skills:

curiosity;
understanding context (group things into categories);
having technical mindset (break down things in smaller steps and make use of them in a logical way);
data design (organize data);
data strategy (mgmt of people, process and tools).

Analytical thinking: identifying and defining a problem and then solving it by using data in an organized, step-by-step manner:

visualization
strategy (stay focused on what you want to achieve)
problem-orientation (ask right questions)
correlation (!= causation)
big-picture and detail-oriented thinking.

With time and practice, you’ll acquire the structured thinking: recognize problem/situation, organize availlable info, reveal gaps and opportunities, identify the options. It will help you to see differently and to find solutions.

Below, the phases, in details.

1. Ask

It’s impossible to solve a problem if you don’t know what it is. These are some things to consider:

Define the problem you’re trying to solve
Make sure you fully understand the stakeholder’s expectations
Focus on the actual problem and avoid any distractions
Collaborate with stakeholders and keep an open line of communication
Take a step back and see the whole situation in context

There can be different types of problems, including:

Make prediction (e.g.: want to know the best advertising method to bring new customers, based on location, type on media and so on);
Categorize thing (like clustering, e.g.: classify customer service calls based on keywords or score);
Spot something unusual (anomaly detection, e.g.: to spot on and set alerts when certain data doesn’t trend normally);
Identify themes (like unsupervised clustering to discover broader concepts/trends);
Discover connections (e.g.: find similar situations, common in UX, common also in logistics);
Find patterns (e.g.: timeseries).

When it’s possivble, try to formulate SMART questions (please, don’t make me write again about SMART acronym I already wrote about it). Avoid “leading questions” (like when you want your team to confirm what you already have in mind), closed-ended questions and vague questions.

Before going to the other phases, let’s see the “Data Lifecycle“, with the most used tool in data analytics: spreadsheets (but same applies with eveything: as I wrote at the beginning, be a master of the concepts before looking at the tools).
Spreadsheets in data analytics lifecycle:

Plan for the users who will work within a spreadsheet by developing organizational standards. This can mean formatting your cells, the headings you choose to highlight, the color scheme, and the way you order your data points. When you take the time to set these standards, you will improve communication, ensure consistency, and help people be more efficient with their time.
Capture data by the source by connecting spreadsheets to other data sources, such as an online survey application or a database. This data will automatically be updated in the spreadsheet. That way, the information is always as current and accurate as possible.
Manage different kinds of data with a spreadsheet. This can involve storing, organizing, filtering, and updating information. Spreadsheets also let you decide who can access the data, how the information is shared, and how to keep your data safe and secure.
Analyze data in a spreadsheet to help make better decisions. Some of the most common spreadsheet analysis tools include formulas to aggregate data or create reports, and pivot tables for clear, easy-to-understand visuals.
Archive any spreadsheet that you don’t use often, but might need to reference later with built-in tools. This is especially useful if you want to store historical data before it gets updated.
Destroy your spreadsheet when you are certain that you will never need it again, if you have better backup copies, or for legal or security reasons. Keep in mind, lots of businesses are required to follow certain rules or have measures in place to make sure data is destroyed properly.

At teh beginning and everytime you feel lost, remember the context: context can turn raw data into meaningful information. It is very important for data analysts to contextualize their data. This means giving the data perspective by defining it. To do this, you need to identify:

Who: The person or organization that created, collected, and/or funded the data collection;
What: The things in the world that data could have an impact on;
Where: The origin of the data;
When: The time when the data was created or collected;
Why: The motivation behind the creation or collection;
How: The method used to create or collect it.

Focus on:

Understand who is the stakeholder (project manager, boss, others);
Who manage the data;
Where you can go if you need help.

2. Prepare

You will decide what data you need to collect in order to answer your questions and how to organize it so that it is useful. You might use your business task to decide:

What metrics to measure
Locate data in your database
Create security measures to protect that data

How can we collect data?

Directly
Second-party (an experienced org that collect for you)
3rd party (an org that doesn’t record directly, need to check bias and consistency before use them)

When population (all possible data values in a certain dataset) is big/difficult, maybe you can evaluate to get only a sample.

We can have different data formats:

Qualitative: categorical/text
Quantitative: numbers
Discrete/continue (e.g.: currency vs time)
Nominal/Ordinal (e.g.: “yes/not/notsure” vs 1-2-3-4-5)
Internal/External
Structured/Unstructured (tables/spreadsheets/DB in rows columns vs audio/video/socialmedia)

And different datamodels (more here and here):

Conceptual –> business models
Logical –> data entitites
Physical –> tables

In real world, data comes usually in many different formats and some data transformation is often necessary, including:

Adding, copying, or replicating data
Deleting fields or records
Standardizing the names of variables
Renaming, moving, or combining columns in a database
Joining one set of data with another
Saving a file in a different format. For example, saving a spreadsheet as a comma separated values (CSV) file.

This is done for different purposes, like:

Data organization: better organized data is easier to use
Data compatibility: different applications or systems can then use the same data
Data migration: data with matching formats can be moved from one system to another
Data merging: data with the same organization can be merged together
Data enhancement: data can be displayed with more detailed fields
Data comparison: apples-to-apples comparisons of the data can then be made

Another big difference in the way to store and display data is:

Wide data	Long data
Creating tables and charts with a few variables about each subject	Storing a lot of variables about each subject. For example, 60 years worth of interest rates for each bank
Comparing straightforward line graphs	Performing advanced statistical analysis or graphing

This was about the way to collect/store/exchange data, but what about the content?

Biases

Sampling bias: when you miss someone representative in the population sample
Observer (/Experimenter/Research) bias: different specialists can measure/see different data (e.g.: a nurse taking a measure of something in a patient)
Interpretation bias: tendency to intepret ambiguos situtation in a pos. or neg. way
Confirmation bias: tendency to confirm expected results based on previous experience/beliefs

Data credibility

Reliable
Original
Comprehensive
Current
Cited (from reliable org?)

Data ethics & privacy

Well founded standards of “right and wrong” on how data is collected/used/shared

Ownership
Transparence
Consent (individual rights on how and how long data will be used/stored)
Currency
Privacy (preserve user identity)
Openness (free data access)

On the last point, open data, there are different sources we can use, here the most famous:

U.S. government data site: Data.gov is one of the most comprehensive data sources in the US. This resource gives users the data and tools that they need to do research, and even helps them develop web and mobile applications and design data visualizations.
U.S. Census Bureau: This open data source offers demographic information from federal, state, and local governments, and commercial entities in the U.S. too.
Open Data Network: This data source has a really powerful search engine and advanced filters. Here, you can find data on topics like finance, public safety, infrastructure, and housing and development.
Google Cloud Public Datasets: There are a selection of public datasets available through the Google Cloud Public Dataset Program that you can find already loaded into BigQuery.
Dataset Search: The Dataset Search is a search engine designed specifically for data sets; you can use this to search for specific data sets.

Organize data

Naming convention
Foldering
Archiving older files
Align naming and storage practices in the team
Develop metadata practices

3. Process

Clean data is the best data and you will need to clean up your data to get rid of any possible errors, inaccuracies, or inconsistencies. This might mean:

Using spreadsheet functions to find incorrectly entered data
Using SQL functions to check for extra spaces
Removing repeated entries
Checking as much as possible for bias in the data

The first thing to do with data is a clear concept: ensure integrity (accuracy, completeness, consistency). Do you think it’s trivial? Well, it means probably you never worked with data in a real environment. There are a lot of issues you can encounter, like data replication (e.g.: not correctly in sync), data transfer (e.g. incomplete transfer), data manipulation done wrong, human error, hacking, failures (e.g.: hardware of a fire in a data center without backup) and many others.

When you are getting ready for data analysis, you might realize you don’t have the data you need or you don’t have enough of it. In some cases, you can use what is known as proxy data in place of the real data. Think of it like substituting oil for butter in a recipe when you don’t have butter. In other cases, there is no reasonable substitute and your only option is to collect more data. Consider the following data issues and suggestions on how to work around them.

Data issue 1: no data

Possible Solutions:

Gather the data on a small scale to perform a preliminary analysis and then request additional time to complete the analysis after you have collected more data.
If you are surveying employees about what they think about a new performance and bonus plan, use a sample for a preliminary analysis. Then, ask for another 3 weeks to collect the data from all employees.
If there isn’t time to collect data, perform the analysis using proxy data from other datasets. This is the most common workaround.
If you are analyzing peak travel times for commuters but don’t have the data for a particular city, use the data from another city with a similar size and demographic.

Data issue 2: too little data

Possible Solutions:

Do the analysis using proxy data along with actual data.
If you are analyzing trends for owners of golden retrievers, make your dataset larger by including the data from owners of labradors.
Adjust your analysis to align with the data you already have.
If you are missing data for 18- to 24-year-olds, do the analysis but note the following limitation in your report: this conclusion applies to adults 25 years and older only.

Data issue 3: wrong data, including data with errors*

Possible Solutions

If you have the wrong data because requirements were misunderstood, communicate the requirements again.
If you need the data for female voters and received the data for male voters, restate your needs.
Identify errors in the data and, if possible, correct them at the source by looking for a pattern in the errors.
If your data is in a spreadsheet and there is a conditional statement or boolean causing calculations to be wrong, change the conditional statement instead of just fixing the calculated values.
If you can’t correct data errors yourself, you can ignore the wrong data and go ahead with the analysis if your sample size is still large enough and ignoring the data won’t cause systematic bias.
If your dataset was translated from a different language and some of the translations don’t make sense, ignore the data with bad translation and go ahead with the analysis of the other data.

* Important note: sometimes data with errors can be a warning sign that the data isn’t reliable. Use your best judgment!

Sample size

Population: The entire group that you are interested in for your study. For example, if you are surveying people in your company, the population would be all the employees in your company.

Sample: A subset of your population. Just like a food sample, it is called a sample because it is only a taste. So if your company is too large to survey every individual, you can survey a representative sample of your population.

Margin of error: Since a sample is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population. This difference is called the margin of error. The smaller the margin of error, the closer the results of the sample are to what the result would have been if you had surveyed the entire population.

Confidence level: How confident you are in the survey results. For example, a 95% confidence level means that if you were to run the same survey 100 times, you would get similar results 95 of those 100 times. Confidence level is targeted before you start your study because it will affect how big your margin of error is at the end of your study.

Confidence interval: The range of possible values that the population’s result would be at the confidence level of the study. This range is the sample result +/- the margin of error.

Statistical significance: The determination of whether your result could be due to random chance or not. The greater the significance, the less due to chance.

Hypothesis testing

Getting meaningful result from a test.
Suggest by Google: we need at least significance .2, so statistical power .8 (80%).

Commonly, 0.05 or even 0.01. Hard science (e.g.: physics) go low to 5 sigma (3 * 10^-7 = 0.0000003).

Type I Error: The incorrect rejection of a true null hypothesis or a false positive.
Type II Error: The incorrect failure of rejection (so, “the acceptance”) of a false null hypothesis or a false negative.

When data partial or not available: use proxy data

New auto dealership can’t wait until the end of the month and want sales projections now: the analyst proxies the number of clicks to the car specifications on the dealership’s website as an estimate of potential sales at the dealership.
A brand new plant-based meat product was only recently stocked in grocery stores and the supplier needs to estimate the demand over the next four years: the analyst proxies the sales data for a turkey substitute made out of tofu that has been on the market for several years.
Wants to know how a tourism campaign is going to impact travel to their city, but results not available yet: the analyst proxies the historical data for airline bookings to the city one to three months after a similar campaign was run six months earlier.

Determine best sample size

There are already some “calculators” (online spreadsheets) to determine good numbers for research, where we can observe for example that: on a population of 500, we will see that, for 95% confidence level and 5% of margin of error, we need more than 200 people (if we will choose less than 200 people with these parameters, our study won’t be good enough).

Confidence level: The probability that your sample size accurately reflects the greater population.
Margin of error: The maximum amount that the sample results are expected to differ from those of the actual population.
Population: This is the total number you hope to pull your sample from.
Sample: A part of a population that is representative of the population.
Estimated response rate: If you are running a survey of individuals, this is the percentage of people you expect will complete your survey out of those who received the survey.

Margin of error

Max amount that the sample results are expected to differ from those of the actual population.
Closer to 0 –> will better match the entire population.

If a survey say 60% of the asked people choosed something, but you have 10% of margin of error, it means it could be 50-70%. Let’s say the confidence level is 95%, it means that we’re “sure” that 95% of population will chose in that range 50-70%. But since margin of error overlaps 50%, it’s inconclusive. Calculation post-hoc

Example on margin of error

For example, suppose you are conducting an A/B test to compare the effectiveness of two different email subject lines to entice people to open the email. You find that subject line A: “Special offer just for you” resulted in a 5% open rate compared to subject line B: “Don’t miss this opportunity” at 3%.
Does that mean subject line A is better than subject line B? It depends on your margin of error. If the margin of error was 2%, then subject line A’s actual open rate or confidence interval is somewhere between 3% and 7%. Since the lower end of the interval overlaps with subject line B’s results at 3%, you can’t conclude that there is a statistically significant difference between subject line A and B. Examining the margin of error is important when making conclusions based on your test results.

Dirty data

Data can be (and often will be) incomplete, incorrect, irrelevant. You will deal with data outdated, duplicated, missing. And that’s a big issue: there are high costs for dirty data (here estimated on 15-25% of total revenue of companies).

Verification

After cleaning, compare initial data with cleaned ones.
Focus on the big picture
Manually check on some cases

Document everything!

Like in every good job, there are several reason to document your activities (in a changelog):

recover data-cleaning errors!
inform others of change
determine quality of data

4. Analyze

You will want to think analytically about your data. At this stage, you might sort and format your data to make it easier to:

Perform calculations
Combine data from multiple sources
Create tables with your results

Analysis basics in 4 steps:

Organize
Format and adjust
Get input from others
Transform data

Now that you have your data ready, this is the core activity you can imagine when thinking of a data analyst (but you’ll be surprised in finding that other activities, like the preparation, data cleansing in particular, can be the most time consuming and requires more effort, compared to this one). In this step, like in others, you’ll have plenty of tools to perform more or less the same identical operations: some are free (or even open-source), others are paid (quite a lot); some are installed on your stand alone machine, others are on-premise in your data center and others are available on their cloud (or a mix of these cases). I will not write here about all these options, this post is already huge (sorry for that, but real knowledge cannot be shared neither in 1-minute Tik Tok video, nor in a tweet). In the Google specialization, the course on analysis was supported by code in R. To be honest, I prefer Python, but… yes, I used R-studio too (please don’t tell anybody, I’m not so proud). Just for the sake of science, I’ll even share here a screenshot – it’s about one of the most amazing dataset I ever saw, the Quartet by Francis John Anscombe (it’s a mindblowing demonstration that completely different distributions can produce the same descriptive statistics – keep them in mind, when someone in the media gives you only the mean in a complex phenomenon, in a world where everyone pretend to understand data):

R-studio: you can see the code in the upper-left window, than a terminal with output below; data visualization (graphs represing the datasets) on the right. This one I used online, but it’s also possible to install it locally.

Some of the reasons that explain the rise of R are its features: accessible, data centric, open source, wide communities, possibility to reproduce analysis, ability to manage lots of data and data visualization. You can find these features in Python, too, even if R is built mainly for staticians (Python is an entire world, used by completely different categories of people, mostly developers). The course by Google explains everything from basics: the IDE, data structures, manipulations/convertion of formats, packages (“libraries”) and cover also basics of data visualization (honestly, I prefer using specific tools for that, like Tableau, but I’ve to admit that R is improving a lot in dataviz). The course finishes with export and reports in MD.

There’s a lot to write about analysis, since it’s the core of analytics, but really it requires solid understanding of statistics that I can’t write in this already huge post. Once again: before starting to manipulate data and extract random metrics, even in the first phase of data exploration, make sure you have basic concepts clear in mind. Otherwise, you’ll be just one of the many people fooled by numbers and fooled by people who knows that, as Nobel prize Professor Ronald Coase said, “If you torture the data enough, nature will always confess“.

5. Share

Everyone shares their results differently so be sure to summarize your results with clear and enticing visuals of your analysis using data via tools like graphs or dashboards. This is your chance to show the stakeholders you have solved their problem and how you got there. Sharing will certainly help your team:

Make better decisions
Make more informed decisions
Lead to stronger outcomes
Successfully communicate your findings

this is perhaps one of the most “satisfying” phases, partly because it also allows expression of one’s own artistic ambitions, and partly because it provides immediate visibility of one’s work, so one feels like a painter in front of one’s canvas.
This part requires not only specific technical abilities with computer, but also an understanding of graphics (line, shape, color, space, movement) and (human) communication.

5 phases of design process

Empathize (think of audience needs)
Define (what your audience needs from the data)
Ideate
Prototype
Test

More than just a graph: dashboards and storytelling

Use a dashboard to localize multiple dataviz in one place.
Remember that people explain with (and love) stories:

Engage audience
Create compelling visuals
Tell the story in interesting narrative

Characters
Setting
Plot
Big reveal
Aha moment

Presentation

Give context, source of data, meaning
If necessary, anticipate hypotesis
Show possible impact of solution
Not exhaggerate colors, but try also to make viz fun (quiz, asking, etc.)
Define your purpose
No distractions
Think of target audience (e.g.: technical or not?)
Concise
Logical flow
Easy to understand

Mistakes to avoid

No story neither logical flow
No titles
Too much text
Hard to understand
Uneven and inconsistent format, no theme
No conclusion or recommended slide

5 seconds

After showing dataviz, wait 5 sec
Ask if they understand
Wait other 5 sec

Think of possible questions

Data source?
Reproducible search?
Everything your audience can be interested in, after they’ll see your data

6. Act

Now it’s time to act on your data. You will take everything you have learned from your data analysis and put it to use. This could mean providing your stakeholders with recommendations based on your findings so they can make data-driven decisions.
This is when you think your job as a data anlyst is done, but actually this is not like a “launch and forget” missile: you may want to guide them and also to “capture the metrics” (or at least receive a feedback), to adjust / double-check your analysis or to continuous improve your process. There are different cases, but remember to follow the data lifecycle and no, the last 2 phases (archive and destroy) aren’t optional. You could be also responsible for the legal part of your data, following all the rules to be compliant with the European GDPR and many many others (ranging from medical to financial to military policies). And, as always, try to improve or at least question your way of thinking and your procedures (not to constantly live in doubts, but to live in awareness).

Final words

Whether you decide to follow this “professional certificate” from Google or study from the many valid sources also found for free on the Internet (or books at a decidedly low cost), I advise you to stop to think about the basics of the processes before you go fiddling with every possible button in the interfaces of the thousands of software. If you only practice with a pencil without studying the basics of the human figure, you’ll be able to draw a wonderful line on your paper, but definitely not a good character: basics are paramount, I cannot stress this enough.
This is the final certificate, showing also the name of the single courses it consists of:

Like almost all the other courses I completed on Coursera (you can find here the list), this is a valid path for a beginner in the field. As an experienced data analyst myself, I can say there are also a lot of small tips I wish I knew when I started (learning by trial and error is great, but it comes with a cost… so better learn from professionals whenever possible). And the capstone is a really great to prove what you learned. For any questions, feel free to ask in the comments, I’ll be glad to answer to your curiosity/doubts 🙂