2011 NHS: Closing notes on data quality

Statistics Canada has made a point of (mis)informing Canadians about the reliability and usability of the 2011 National Household Survey (NHS) data. While Canadians were told that the decline in response rates from prior long-form Census did result in greater data suppression at pretty much every sub-provincial level, most of the data was still published and represented as being of sufficiently high quality. Canadians were told not to be reluctant to use said data. The media for the most part followed this advice, reporting on the first couple of releases with few if any concerns.  Chief Statistician Wayne Smith even made a point of criticising those who cautioned Canadians about the reliability of the published data, stating that critics were doing a “disservice to Canadians”. The following discusses the significantly lowered data quality standards Statscan adopted for the 2011 NHS in order to (mis)represent its data as fit for publication.

Doubling the acceptable non-response threshold

What has largely passed unreported is that Statscan significantly changed the standard of what it deemed an acceptable “global non-response” rate for the 2011 NHS.  Contrast the 2006 Census Data quality and confidentiality standards and guidelines, identical to Census prior to 2006 as well as the 2011 Census, with the 2011 NHS Data Quality and Confidentiality Standards and Guidelines. Pre-2011 NHS, geographic areas with global non-response greater than 5% were flagged as less reliable, and any with a rate of 25% or greater were suppressed. Post-2011 NHS, the acceptable non-response rate was doubled to 50%, and no data below that threshold was flagged as less reliable.

One way to demonstrate the dramatic difference this change in data quality’s had is by looking at the published 2011 NHS data for Census subdivisions (CSDs). Of 5252 CSDs in the 2011 NHS, data for 3439 was fit for release, while data for 1813 was suppressed, almost all for non-response, using the post-2011 NHS data quality standard.  However, using the pre-2011 NHS data quality standard, only 994 of those 5252 CSDs, or 19%, would have been fit for release, while 4258 would have been suppressed. By comparison, of 5418 CSDs in the 2006 long-form Census, data for 4534, or 84%, was fit for release, while data for only 884 was suppressed.

The only way Statscan was able to publish the 2011 NHS data was by remarkably dropping its quality standard by doubling the acceptable global non-response rate.

Accepting incomplete questionnaires

What exactly did Statscan consider an acceptable response to the 2011 NHS questionnaire? Apparently, enumerators had been instructed to accept the long form with as few as 10 of 84 questions answered.  What the standard was for the 2011 NHS and how it differed from previous long-form Census are important questions that, to date, have not been answered. Given that StatsCan lowered the data quality standard by doubling the acceptable non-response rate, one suspects a similarly dramatic change to the acceptable questionnaire completeness rate for the 2011 NHS relative to prior year long-form Census.

Artificially boosting response rates to boost public confidence

How did Statscan boost the expected response rate of  50% to 69% (unweighted, 77% weighted)? The initial estimate was based on a test run of the 2011 long-form Census conducted in 2008. Because it was a test, the 2008 survey was  voluntary, as the 2011 NHS that replaced the long-form Census ended up being. The response  rate received for the 2011  long-form Census test was 46% (unweighted, 45% weighted). How Statscan managed to significantly raise the response rate between the test and the survey is another question that, to date, has not been answered – although one suspects a lowered threshold for questionnaire completeness had something to do with it.

Hiding item non-response

Item non-response is useful in that, in combination with the survey sample size and complete non-response (incomplete questionnaires), it can give an idea of how many responses a specific statistic is based on. Given the suspect 2011 NHS data on immigrant and Aboriginal populations, a number of social advocacy groups have inquired about item-non response for the immigrant and Aboriginal/First Nations questions. They have received responses from Statscan that read, “Single response rate by question are not available,” and redirected to the imputation rates.  The ‘it’s not available’ line doesn’t explain why item non-response was available for the 2011 long-form Census test in 2008, as well as for the 2006 long-form Census, including the Immigration and Aboriginal releases. Presumably, Statscan has chosen not to make it available for the 2011 NHS. Even if it was available, given reports that practically incomplete questionnaires were accepted to boost sample size and lower published non-response, the item non-response data would not be all that meaningful.

If Statscan’s goal was to reassure Canadians about 2011 NHS data quality, its lack of transparency regarding the significant drop in quality standards combined with misinformation provided in response to data users’ requests certainly didn’t help.

2 thoughts on “2011 NHS: Closing notes on data quality”

Leave a Reply

Your email address will not be published. Required fields are marked *