Check out the infographic below to learn about 5 common data caveats! Below the infographic you will find more information on data suppression, sample size issues, timeframe relevance, geographic relevance, and content application.
What is a Data Caveat?
A caveat is anything that has been or should be flagged about the dataset you’re using. This can include data limitations, notes, and context. Keeping caveats in mind will help guide your interpretation so that you have a more thorough understanding of what the data is telling you.
Being transparent about caveats -- limitations, context, and assumptions -- will strengthen the integrity of your work and results. You can also evaluate data sources based on how transparent they are about their data caveats!
Data Quality Caveats
A data quality caveat is a warning, note, or context that provides insight into the quality of a dataset. Data caveats are important for determining appropriate use cases for a dataset.
Want to learn more about data quality? Check out Data to Decisions: Data Quality.
Data Suppression
If the geographical or organizational level of a dataset is small enough that it's possible to identify individual survey responses, the Statistics Act requires that this data be suppressed. This is to protect individual privacy and to ensure data collected is private, secure, and confidential. It's also possible for data to be suppressed due to unsatisfactory data quality.
Statistics Canada, for example, will always note when data has been suppressed. You may see a superscript number that corresponds to a note below the data set. This is an example of clearly communicating a data caveat.
Data suppression results in gaps in the dataset. Working with a dataset that has a high level of suppression can lead to a biased and non-comprehensive understanding of the data set. If the data gaps due to suppression are relatively small, it may be possible to make assumptions about what the missing data is.
Always document your assumptions!
Sample Size Issues
Sample size issues can arise when data has been collected from a low number of people and then used to represent a whole population. This can be problematic depending on the size of the sample and the size of the total population. Keep in mind that if the sample size is too small, the there will be a high degree of uncertainty that the data accurately represents the population.
Be mindful of potential sample size issues when working with survey data -- data that was collected from a select number of people. Some data sources, such as Statistics Canada, follow vigorous methodologies to ensure sample size reflects in good quality data. Other data sources, such as industry organizations and governmental departments, also provide useful datasets. However, it is important to review individual sampling methodologies as standards vary. Noting sample size and data collection information, whether negative, neutral, or positive, is an example of documenting a caveat.
Sample size issues are not a concern when working with census data since the census collects data from the whole population.
Ask a data expert for help in determining the quality of a dataset!
Data Relevancy Caveats
Data relevancy caveats help you determine how appropriate a dataset is to use to answer your research question or solve your problem. Data relevancy caveats associated with your own work are also important to communicate.
Timeframe Relevance
Datasets are always associated with a timeframe that the data represents. It is important that the dataset represents a timeframe that aligns with your work. For example, if you're looking to understand trends over the last year, it would be most relevant to use data that represents activity that occurred in the last year. Sometimes, however, data availability is limited for the year in question. It can be appropriate to use older datasets if is the best option available.
Check that the dataset you're working with is the closest available to the time frame you are studying!
Geographic Relevance
All population data is associated with a specific geographic area. It is important that the geographic area associated with a dataset is relevant for your analysis. As the geographic hierarchy highlights, data is available for Canada, geographic regions of Canada, provinces, municipalities, and more. It is important that the level of geographic specificity in your research problem matches the level of geographic specificity of the dataset. Often, data for a large geographic area, such as a province, doesn't capture the nuance between municipalities within the that province.
Carefully review the geographical areas associated with your datasets and document any assumptions you've made!
Dataset Definitions
Datasets usually come with a set of definitions that explain what each variable means within the dataset. It is important to ensure that the data is measuring what you believe it is to be measuring. Within a dataset, a variable is often given a short name for a heading which may not capture the full extent of what it's measuring. Also, when using data from multiple data sources, it is possible that different definitions of the same variable are used. All information regarding definitions and variations in definitions should be documented.
Review the variable definitions that correspond to your datasets to ensure compatibility with each other and your work!
En français
This article has been sponsored in part by the Future Skills Centre.