Data Reliability and Validity, Redux: Do Your CIO and Data Curators Really Understand the Concepts?

Data Reliability and Validity, Redux: Do Your CIO and Data Curators Really Understand the Concepts?

Here are two recent entries on the big but neglected issue of data reliability and analytic validity (DR&AV), from the vast commentariat that is LinkedIn:

One of my complaints with hashtag#bigdata, is there isn’t enough focus on getting the right kind of data. We deal with this in healthcare all the time. Much of our transactional data is just outcomes. There’s a push in the industry to change health behaviors (generally: population health). But if we’re not collecting meaningful behavioral data (could be secondary data like credit card purchases or primary like surveys about health attitudes), we can’t determine what behaviors are driving the outcomes! — Biostatistical Data Scientist, LinkedIn commenter, June 2016.

A potential problem is that people know less and less as to how to conduct surveys well. Conducting a survey is easier than ever, but the same technologies that make surveys easier are also making response bias easier to creep into the results as well. I suspect that we are headed to a disaster of Literary Digest proportions, for many of the same reasons. Of course, the data we have is very huge. But, at least for the problem that we want to analyze, the data is all wrong. Yet, there seems to be a big resistance to cleverly trying to address these problems instead of worshipping blindly at the altar of technology. —Sociological Data Scientist, LinkedIn commenter, September 2017.

Note that neither of these commenters mentions the words reliability or validity. But that’s what they’re talking about.

In the first, the biostatistical data scientist asks a basic question about data validity, i.e., the absence of “meaningful behavioral data” relevant to answering questions about factors that are “driving medical outcomes.”  Apparently, they have a lot of data, but not the right kind, for their purposes. In effect, she is saying that all their non-behavioral data is invalid because it does not measure what they need to know. From that standpoint, she might as well not have any data at all.

This is a teachable moment for those who insist that Big Data solves all problems. Her example shows you can have all the data you can possibly hoover up into a warehouse or a lake, or to fill the Marianas Trench, and have it all be invalid. This means, of course, invalid for the purposes of any one or a number of teams of analysts and data scientists looking for answers to questions posed, a priori, to collecting data and commencing analytics. These were specific questions they wanted to ask of the data, in terms of variables to investigate, that apparently were not communicated to those responsible for data collection, selection, or construction. (Another reason why Statistical Data Scientists, along with the CS-IT Data Scientists, in their Data Manager role, should be in the lead with specifications for data needs or requirements for analytics, before data collection. But that’s another blog post.)

The second comment addresses both issues of data reliability and validity stemming generally from watered-down skill sets and lowered levels of theoretical (mathematical and logical) understanding of survey research, among survey researchers themselves. He states that it is much easier to conduct a survey now than ever before, in terms of a one-day, one-question pop-up. But to conduct a survey properly, so that data is not rendered unreliable, and findings invalidated by response or other types of bias, is and has remained a painstaking process. (See “Bias in Survey Sampling.”)

I’ll put it this way: my inspection of Survey Monkey did not show a capacity for tests of reliability and validity on the data collected, nor item analysis, nor other diagnostics that exist for survey data. (Maybe the capacity is there, in which case I stand corrected.) And the second commenter raises the example of the ultimate triumph of bad (unreliable) data from survey research, the Literary Digest scandal in 1936, and suggests that because we are not paying attention to the basics of DR&AV, we are heading in the same direction now.

This leads me to think in terms of the reliability of a measuring instrument, e.g., a questionnaire (survey instrument) administered to gauge employees’ job satisfaction. This is an evergreen example. But some statistical data scientists (statisticians, and their more applied sisters and brothers in economics and the social sciences) themselves do not necessarily appreciate that reliability applies not just to the numbers that result from the measurement. It applies to the way in which a survey question that produces the numbers is worded or phrased.

The wording of the questions must be as unambiguous as possible, or it will trash the survey. I’m taking a survey of whether employees wear jeans to work (yes or no), and what “kind” of jeans, as part of a marketing study. Responses are limited to a set of multiple choices. But the word “kind” can refer to a brand of jeans (Levi’s, Lee’s, etc.), or a style of jeans—skinny, boot cut, relaxed fit, and so on. I want to know what style, but I ask respondents to specify, again, the kind of jeans, and give them a choice between several brands.

Your own confusion about what you say you want, and what you will get from the survey, enters the heads of some respondents. Some of them think “kind” means style, like you, and are puzzled that they are given brand names from which to choose. Some of them think it means brand names, and are just fine in specifying Levi’s or Lee’s or another brand. But if we were to poll the respondents on that question, asking whether they thought “kind” meant style or brand, there would be varying responses relative to what they believed they were being asked, rather than, necessarily, what you wanted to know. And you were not sure as well. The result? Increased noise in the survey data, and less reliability.

Another example illustrates how to measure data reliability in a different context and with a different tool. But the principles of consistency and repeatability are the same, with the former implying the latter as reliability’s defining characteristics. In the employ of a large defense contractor a few years ago, I sat in conference at one of our sites in Northern Virginia with 35 engineers of various stripes doing preliminary requirements specifications for a proposed weapons system. We spent a full day brainstorming what they would be.

At the end of the first day, I and an assistant scripted a questionnaire designed to capture data on the consistency and repeatability of the participants’ understandings of requirements they identified and named as critical to a successful system. We did not have time or space on a short survey instrument to ask each person to state what was meant by the terminology in Requirement 1, Requirement 2, and so on. But we got to the question of internal consistency by lowering the information requirements of the survey, and going through a logical back door: we asked the respondents to rank their top 20 requirements by importance (1 = highest importance, 20 = lowest).

This isn’t saying we expected each respondent to rank the requirements identically. It happens rarely, if ever. And that wasn’t the point. if there were consistent and repeatable understandings, i.e., reliable understandings, respondents, more or less, would all be ranking the same list. Said another way, everyone responding to the survey would be ranking the same definitions of each requirement. To test for this, I used a statistic known as Cronbach’s Alpha (a ) that correlated the ranking of each requirement with every other one, and averaged the correlations. a  is bounded by 1 and 0. It’s a correlation measure for all the data collected by the questionnaire or survey. In this instance, if the same lists were being ranked, and thus the same requirement definitions, a  would approach 0.5 or better, telling us that about 50 percent or more of the time that was the case.

In our tests, a  averaged 0.15, indicating an absence of consistency and repeatability in participants’ terminological understandings of each requirement. To me, the findings indicated strongly that the following kind of situation likely prevailed in the data: for Requirement 1, Respondent 1 had one understanding, Respondent 2 another, and Respondent 3 the same as Respondent 1, but different than Respondent 4, etc. In short, because the respondents had different understandings of each requirement, most of the time each person was ranking a different list. This generated much noise and not much signal, in an effort that demanded every engineer associated with the project be on the same page throughout its execution.

The engineers had to define and refine their terms to eliminate the ambiguity in wording or phrasing of requirements terminology. It was crucial that every engineer knew they were all talking about the same thing when they discussed a given specification. This is data reliability in a different context, and a critical element that should occur in any organizational effort that analyzes data: at the beginning.

In case you lost track, that’s my fault. So, one more time: The point is that measures fail reliability tests the lower their reliability coefficient RXX (I discussed this back in July of this year, “Data Reliability and Analytic Validity for Non-Dummies”), or their Cronbach’s Alpha score, or other diagnostic statistics. And the lower is the correlation with any other measure in the dataset. Low correlations mean that for any two or more variables (all considered pairwise), fewer data points or observations may move positively or negatively in tandem. And false positives (Type 1 Error), or in this case, false negatives (Type II Error), are the result of degraded consistency and repeatability properties of the measurement, masking more reliable relationships between variables that are better correlated.

It’s a hot mess, isn’t it? Go ask your CIO and your Data Curators: what do you know about all this? How reliable is our data? And is it valid with respect to the needs of analysts and the various stripes of data scientists in our employ? If you’re one of those who need valid data, of course, you’ll know just what to say.