top of page
Copy of Untitled Design.png

 Data to Decisions: 

 Data Quality 

Data Quality

Data to Decisions:

Data availability, data consistency, and data accuracy impact data quality. Learn how to effectively collect and analyze data by ensuring you're working with good quality data.

Anchor 1
Data Quality

When gathering and using data, it is important to be aware of various issues that may affect the quality, and consequently the interpretation, of the data.

​

Some of the most common data quality issues are related to data availability, consistency, and accuracy.

​

It's important to conduct quality control checks on any gathered data and document any data issues you identify (and any actions taken to address the issues). This helps ensure analysis and interpretation is as transparent as possible.

​

When examining a dataset for potential issues, some key aspects to watch for include:

  • big shifts in the data

  • outlier data points

  • breaks or gaps in the data

Data Availability

The degree to which data is readily available to users when and where they require it.

Data Consistency

The usability of applicable data. The process of keeping data uniform as it moves across, and between, various applications.

Data Accuracy

The degree to which data is error-free and can be used as a reliable source of information.

Shift in Data

The graph below shows a time series data set with an observable shift in data at the Aug-20 mark. Prior to this date, the data had been relatively stable for quite some time. After a steep increase between Aug-20 and Nov-20, the data levels out again at a much higher level.​

Copy of [update the copies not this one] Data to Decisions - Full document (13).png

Such a drastic shift in data must be investigated. We need to ask questions to try determine the cause of the shift, such as:

  • Did something happen in the housing market or economy to justify a legitimate increase in data?

  • Was there a change in the definition of New Housing Price Index around Aug-20 to result in a shift in the data set?

Data Shift

A change in data distribution (for example, a change in classification or geography).

Outlier Data

The graph below is an example of a data set with an outlier data point. All data points, except for NL, follow a similar pattern.

Copy of [update the copies not this one] Data to Decisions - Full document (14).png

The outlier data point should be researched to see whether it is the result of a mistake or accurately represents what it is measuring.

​​

For example, we must ask ourselves:

  • Did something happen in NL at the time that resulted in significantly higher expenditures than the rest of the country?

  • Was the NL data incorrectly entered into the database resulting in a data point much different than all the rest?

Outlier Data Point

A data point that differs significantly from other observations.

Data Gaps

The table below is an example of a data set with a data gap—the data for the year 2006 is missing.

​​

Copy of [update the copies not this one] Data to Decisions - Full document (15).png

The reason for this should be investigated:

  • Is it simply a mistake – was the data for that year accidentally omitted?

  • Is the data for that year unavailable? If so, why?

  • Was the survey not conducted for that year?

  • Did confidentiality practices require that year of data be suppressed?

Data Gap

Data for particular elements or data sets are knowingly or unknowingly missing.

If questions about the data cannot be answered, extreme caution should be exercised when using that data.

​

Data that is published without proper documentation or leaves critical questions unanswered may indicate issues with the credibility or reliability of the source.

​

If you do not fully understand your input data, it is unlikely that you will accurately interpret your results.

Data Availability

Data availability can vary across time periods, geographic areas, and topics. For example, data from a survey introduced in the 1980s will not be available for earlier years because it was not collected before then. Additionally, the format in which data is available can change over time. Most recent data from Statistics Canada is accessible online in a usable format for download. However, older historical data, such as census data from before the 1970s, may only be available in scanned PDF form, which can make data selection and usage more challenging.

Suppression

If the geographic or organizational level of a dataset is small enough that individual survey responses could potentially be identified, the Statistics Act requires that the data be suppressed. This measure is in place to protect individual privacy and ensure that the collected data remains private, secure, and confidential.

​

As a result, there may be occasional gaps where data is unavailable due to suppression. For example, a dataset might be available for all municipalities in an area except one, to comply with confidentiality requirements.

Statistics Canada will note when data has been suppressed. You may see a superscript number next to a data point which corresponds to a note below the data set.

Time series gaps

Alternatively, years of data may be missing from a time series data set. In this case, if the data gap is relatively small, it may be possible to fill in by making an assumption about what the data within the gap looks like.

Interpolation

The process of filling in a data gap based on assumptions is known as interpolation. The following examples demonstrate three different methods for filling in a missing year of data within a dataset, each based on a different assumption. It's important to note that this is a straightforward example with only a single year of missing data. The more gaps a dataset has, the more complex the process of attempting to fill them becomes.

Copy of [update the copies not this one] Data to Decisions - Full document (16).png
Interpolation

The process of filling in a data gap using reasonable, justifiable, and well-documented assumptions.

Interpolation Assumptions
Copy of [update the copies not this one] Data to Decisions - Full document (15).png

The data for the year 2007 is missing. One way of estimating the missing year is to simply take the average between the two data points on either side of the missing year. The population for 2006 was 412, and the population for 2008 was 425, so one could assume that the population for 2007 was 419.

​

(412+ 425) ÷ 2  =  419

The method used to fill in a data gap depends on the characteristics of the data within the data set:

 

  • The straight line assumption works well with consistent data that follows a predictable pattern, making it reasonable to assume that the annual percentage change between the first and last year of data is relatively stable.

 

  • The average annual percentage approach may better reflect changes within a data set that is less consistent but still relatively predictable (i.e., no outliers or big shifts).

 

  • The two point average approach may work best with more volatile data in order to capture year-to-year changes.

Data Accuracy

Data accuracy can become an issue when entering, copying, converting, or transferring data. When errors occur, the data may no longer be a reliable source of information, making quality control checks on any manipulated data essential. An example of a human error that could lead to data inaccuracy is a simple typo made while transferring information from a scanned PDF into a digital format, such as Excel.

​

Tip: Outlier data points can be an indicator of data accuracy issues.

Data Accuracy

The degree to which the data correctly describes what it was designed to measure.

Data Consistency

Consistency issues can arise when data is collected from different sources, jurisdictions, or time periods. To ensure that the data is comparable and can be used as part of a valid, consistent dataset, it's important to examine the definitions and descriptions of the datasets being collected. Even if datasets share the same name (e.g., "Net Debt"), they may be defined differently across geographic areas, jurisdictions, and sources.

​

Tip: Big shifts in data can be an indicator of data consistency issues.

Data Consistency

The usability of related data. The process of keeping data uniform as it moves across and between various applications.

A dataset collected from a single source can also undergo changes in definition over time. For example, population data for Town X might be available for several decades. However, if Town X amalgamated with surrounding communities, such as Town Y and Town Z, this would cause a shift in the data, as the population numbers are now combined into a single total. Additionally, this would result in a change in the definition of the dataset itself.

Data Appropriateness

Once you’ve done your quality control check on your data and you’re ready to use it, you also need to make sure you’re using it appropriately. One of the most common issues that arises from inappropriate use of data is called the Ecological Fallacy.

​​

There are several spatial fallacies that can occur through the inappropriate use of data, but the Ecological Fallacy tends to be the most common.

When data at a specific geographic level isn’t available, it can be tempting to apply data from other geographic areas/levels. for example, applying provincial or national data at a local level. For the majority of localities, the picture painted by national data is quite different than the local reality. Using data in this way will almost certainly lead to a misleading interpretation. This is called Ecological Fallacy.

Ecological Fallacy

The Ecological Fallacy occurs when conclusions about individuals or small areas are mistakenly drawn based on data from a larger group or geographic area. Assuming that what applies to a group also applies to each individual member can lead to misleading or inaccurate data analysis and interpretation. This is why it is crucial to use data at the appropriate geographic level.

​

For instance...

  • If a province has a declining population, it is incorrect to assume that every community within that province is experiencing a population decline.

  • If a community has a large, highly skilled population, it is incorrect to assume that every person in that community is highly skilled.

Example of Ecological Fallacy

From 2006 to 2016, the province of Newfoundland and Labrador experienced slow growth, with the total population increasing by 3% over the ten-year period.

​​

The Avalon Peninsula, however, experienced a 9% population growth over that same time frame and within the Avalon Peninsula, the St. John’s population grew by 14%.

​​

Meanwhile, the population on the Northern Peninsula declined by 14%.

​​

This example demonstrates how provincial data may not accurately reflect the experiences of smaller geographic areas within the province.

Copy of [update the copies not this one] Data to Decisions - Full document (20).png
bottom of page