Following Statistics Canada’s July 2014 Labour Force Survey (LFS) flub, speculation about the survey’s reliability abounded. The agency subsequently issued a statement to assure the public it was a one-off ‘human’, as opposed to systemic, error. Then immediately followed it up with an August LFS release that raised a collective eyebrow. The interesting part about StatsCan’s explanations for the wonky LFS stats of late is the lack of data quality measures to back them up.
According to StatsCan, public confidence in its data quality is of paramount importance. To this end, it maintains a Policy on Informing Users of Data Quality and Methodology. Unfortunately, the policy diverges widely from current practice.
The only explanation the Review of the July 2014 LFS release provides for the nearly 42,000 job revision is “a higher number of individuals who reported being employed in June, and who did not respond in July, being imputed as either out of the labour force or unemployed in July.”
For starters, StatsCan defines imputation as “the process used to assign replacement values for missing, invalid or inconsistent data that have failed edits.” Basically, computing guesstimates, either for individual questionnaire items or for whole units (either entire sampled LFS households or members thereof).
The short (6-page) review includes 25 instances / variations of ‘impute’. It also included 54 instances / variations of ‘system’, the agency’s intended focus. Not surprisingly media coverage of the release focused on the latter – but not necessarily for lack of trying to obtain info on the former, as it turns out.
Glaring by omission from the July 2014 LFS review was the imputation rate for employment status. Other than the review, few LFS docs even mention it. The Methodology of the Canadian LFS mentions it just once, to advise readers an imputation rate is “also a quality indicator with regard to data processing.”
The LFS methodology guide also notes that “the maximum non-response is usually attained in July.” While the guide doesn’t provide the July rate, it gives a national average non-response rate of 4.9% for 2005. More recent estimates from StatsCan peg non-response at 10-12%. While the guide doesn’t provide imputation rates, it refers to LFS data quality reports that would have them.
Since its policy (Policy 1) indicates, “Statistics Canada will make available to users indicators of the quality of data it disseminates,” the solution seemed straight-forward: Just request the LFS quality reports. It wouldn’t suffice to just say StatsCan refused to provide the information.
[At first, StatsCan dismissed the request, stating the information in the methodology guide is all the agency had available. A response specifying the requested data quality reports went unanswered.
A subsequent request for the data quality reports was bizarrely dismissed with a note indicating the LFS methodology guide was for internal use only and could not be released due to confidentiality. The request wasn’t for the methodology guide – which is available online.
A note the following day asked for confirmation that the requested data quality reports had been received. A response indicating no reports had been received went unanswered.]
Given the LFS sample size, where each individual respondent represents about 200 people (age 15+), the difference between – not sum of – the employed and unemployed respondents who had their employment status imputed in July was just over 200, or 0.2% of individuals in the sample. That slim a share is the difference between zero and 40,000 jobs in any given month.
[Despite this, the standard error for employment was only 28,500 for July, exactly the same as in August. While one sixth of the LFS sample changes and the response rate fluctuates monthly, StatsCan doesn’t publish a new CV for each release, but rather a twelve month average updated bi-annually.]
The more interesting question is the sum total of LFS imputations, which StatsCan has gone out of its way not to disclose. A bit of info from the US Census Bureau, which is far more transparent than StatsCan when it comes to data quality, may be helpful here.
The Census Bureau’s Current Population Survey (CPS), source for the monthly US household labour force report, provides a chart showing recent monthly non-response rates between 10-11%, similar to the LFS. It also shows the rate has increased post-recession, as it has with the LFS. Notably, the Census Bureau indicates its research shows refusals are more likely to be associated with unemployment.
Unlike the LFS, the CPS does not experience spikes in non-response during the summer. Also notable: Unlike the LFS, the CPS compiles data quality measures like the coverage rates by race, which declined considerably for Black and Hispanic Americans in recent years.
While it doesn’t publish one for the CPS, the Census Bureau does publish imputation (also called allocation) rates for the American Community Survey (ACS). The ACS imputation rates for individual respondents doubled since the recession; for 2013, 8.1% of employment and 19% of wage/salary responses were imputed.
Putting the pieces together: If those who refuse to respond are more likely to be unemployed, and the imputation process involves either copying previous responses over or using data from respondents more likely to be employed, then imputed data will tend to overestimate employment.
Even more so if the imputation rate has increased over time, which the increased LFS non-response and the Census Bureau anecdote suggests it has. Since Statscan can’t factor for race in its LFS imputations (b/c it doesn’t ask about it in the survey), and since other things being equal racial minorities have poorer labour market outcomes, its imputation process likely overestimates employment more than the US Census Bureau’s process.
If there were less households or members available to respond to the July 2014 LFS, necessitating greater imputation, then it would be far from clear whether the revised employment figure was any more reliable than the one originally reported, keeping in mind how small a number of poorly imputed responses it takes to change the reported outcome.
The curious August 2014 LFS was likely also affected by imputation. In addition to many respondents still being away on summer vacation, educational institutions resumed activities. Non-respondents who’d previously indicated working in education and ended up incorrectly imputed as resuming those jobs, along with newly hired educational workers (some replacing departing, retiring ones) would tend to overstate educational service employment. Given the factors in play, efforts at adjusting for the effect could make things worse.
Of course, that’s all speculation based on random data points and anecdotes, since StatsCan refuses to release LFS data quality measures. As it also disregards its own data quality disclosure policy in the process, here too one can only speculate as to why.
If LFS data quality has declined in recent years due to increased non-response, the agency would need to justify why/how that’s happened, given the supposedly mandatory nature of the survey. Whether it cracks down on refuseniks or decides to change it to a voluntary survey, it risks further jeopardising survey reliability (see 2011 long-form Census fiasco). [While administrative data and business surveys can provide better estimates of payroll employment and rough estimates of wages, they can’t provide insight into household labour market attachment – the very reason for the LFS.]
StatsCan should reconsider its refusal to publish LFS data quality measures. If it leads to tough questions, so be it. Pretending to be transparent only contributes to further eroding public confidence, more so when mistakes like the July 2014 LFS flub occur. Americans are quite aware of the limitations of their economic indicators, thanks to their statistical agencies’ transparency. US data users can factor for it in their analysis, and work on improving the metrics can move forward. Canadians deserve the same courtesy.
– Andrew Baldwin adds:
What are the views of the members of the Federal-Provincial Territorial Committee on Labour Statistics on this latest LFS error or has it not met to discuss it yet? For that matter, who are the current members of that Committee now? This kind of information, that should be readily available on the StatCan website, isn’t provided there. In fact the standard report regarding minutes of any of the many Federal-Provincial Territorial Committees for official statistics is this: “The meeting minutes have been provided to the committee members for distribution within their jurisdiction.” StatCan could not be more opaque or less transparent. If anyone on any of these Committees criticizes their methods or procedures, they don’t want the public to know about it.
– The US Census Bureau was kind enough to provide a prompt, courteous and complete response to our inquiry. Courtesy of Steven (last name withheld):
To follow up on our recent conversation, we do not provide allocation rates. However, you can calculate them. If the second letter of a variable name is “X” then that variable is allocated. I will provide an example below to illustrate how to calculate the allocation rate.
Say there are 10,000 instances of HETENURE and there are 2,000 instances of HXTENURE.
The allocation rate would be 2,000/10,000 or .2.
This is how the process was described to me. A detail or two may have been lost in translation so if you have issues please feel free to contact us. I will be out of the office for the rest of the week, you can call our branch line and someone will be happy to assist you.