Depending on who you talk to, cleaning market research data files for use with advanced statistical analyses takes between 50% and 95% of the total time spent working on a dataset. It’s the only way to ensure that the results you see and the conclusions you draw are the result of the treatment and conditions, rather than some tiny, little error dismissed long ago as inconsequential.
Hopefully, the tactics that follow will help you ensure the advanced statistical analyses you run are based on the best quality dataset possible.
First, and more important than perhaps anything else, is to save the first dataset you receive in a separate folder and then never touch it. This is your safety net for that one day down the road when you accidently overwrite the entire file you’re working from. That day will come and you need to build habits to ensure that day doesn’t become a nightmare. Never touch that separate copy.
Second, make it a habit to regularly save additional versions of the cleaned datafile. Minimally, save a new file once a day. Even better, save a new file every time you make a major revision to the dataset. Never simply click on ‘Save.’ Make sure to use ‘Save As’ and save it with a new name. Preferably a number so that all the files remain in order. As before, this will come in handy on that one day you realize you implemented a correction…incorrectly, and need to return to the previous version without losing six days of work.
Begin your formal data checking process by running a frequency distribution of every variable.
- Look to see that the numbers of men and women match your expectations, that the distributions of age, income, education, ethnicity, and children make sense, e.g., you know that far fewer than 50% of people have college degrees and far more than 10% of people have high school diplomas.
- Specifically check that numbers that ought to be very small or very large are actually very small or very large. Make sure most people recognize Tide, Pepsi, and Oprah and that very few people recognize the Acco brand of paper clips.
- Look for answer options that were selected by no one. Should these be zero or did variables in the datafile unknowingly shift over by one column?
- Follow through the logic of the skip patterns. If people answered questions in a certain way, do their follow up answers match those responses? Or did the rows or columns shift here too?
Check variable labeling and coding.
- Make sure numeric values are coded as numeric variables not string variables, and vice versa. This is what will determine the order that answer options appear in your outputs. It will also determine which statistical tests will be ‘turned on’ by your statistical software.
- Check that missing responses aren’t recoded into zeros, thereby making them appear to be valid responses.
- Check that ‘Don’t know,’ ‘All of the above,’ and other non-substantive responses are correctly labeled and immediately identifiable. If not, it’s possible that those 9s and 99s will be improperly included in t-tests or correlations, or treated as valid responses for calculating means and standard deviations.
- Check that every single variable and response option is correctly coded. For example, make sure that the label for ‘Male’ actually matches with responses from men. This simple mistake has already caused numerous, published academic articles to be retracted and knowledge within a discipline to change. Don’t add to that list!
Remove untrustworthy data.
- And of course, make sure to apply standard data quality processes to ensure that low quality data is not part of the final data set. It’s always possible that research participants could become bored or distracted partway through data collection necessitating removing some or all of their data. Don’t leave low quality data in the file simply because you need the sample size. The data errors are most dangerous when they appear in the smallest sample sizes.
Besides ensuring that your data is top quality, running all of these checks also serves as exploratory analysis, a key component of better understanding the basic findings that allow you to generate hypotheses to test with more advanced statistics.
The next time you need to run advanced statistical analyses, make sure you leave enough time for data cleaning and exploratory analysis. You’ll be grateful you did!
You might like to read these:
- Conducting research is easy… if you understand hundreds of cognitive biases!
- 9 Essential Checks for a Questionnaire Pilot Test
- What sample size is right?
With nearly 40 years of experience, Canadian Viewpoint is a field and data collection company that specializes in English and French offline and online services. We offer consumer and medical sample, programming and hosting, custom omnibus, mall intercepts, pre-recruits to central location, mystery shopping, site interviews, IHUTs, sensory, product, and package tests, discussion boards, CATI, facial coding, and other innovative technologies. Learn more about our services on our website. Canadian Viewpoint is a founding board member of CRIC (Canadian Research Insights Council) and named on both the 2019 GRIT Top 50 list of Emerging Players and the Women in Research shortlist for Best Places to Work.
Follow us on Twitter or Linkedin, or sign up for our newsletter.