Dissertation Data: Primary or Secondary?
by Dr. Mathieu Despard
Most PhD students use data for their dissertation, so whether to use primary or secondary data is a very important decision. Primary data is data you collect, such as through surveys, scales, or interviews. Secondary data is data that has already been collected by other researchers and that you analyze.
Ultimately, the decision about using primary or secondary data depends on your research questions or hypotheses. Which type of data will be better for answering these questions or testing your hypotheses? Another consideration is whether you plan to use quantitative, qualitative, or mixed methods. Secondary data is usually analyzed with quantitative methods. Qualitative methods can also be used, though there are important challenges to address.
Here’s the basic trade-off: primary data takes more time and effort to collect, but you have more of a chance to align your measures with your research questions or hypotheses. Secondary data has already been collected, yet the data may have too many limitations to be useful.
I’ve explained these pros and cons in more detail below (not an exhaustive list!), followed by a resource list. After you consider pros and cons, it’s important to have a conversation with your dissertation committee chairperson to discuss your strategy in greater detail and consider other pros and cons not discussed here.
Primary Data
Pros
• Customizing and aligning your measures. When you collect primary data, you can decide exactly what you want to measure to answer your research questions or test your hypotheses. For qualitative methods, you can design interview or focus group guides that are unique for your topic and research participants. For quantitative methods, you choose the right set of independent and dependent variables for your research questions or hypotheses. There are plenty of standardized instruments to help measure certain constructs – which is easier and less risky than designing your own instrument(s) (see below).
• Social work practice impact. If you collect primary data, there’s a good chance you’ll be working with a research partner – an agency or organization through which you will recruit research participants, conduct data collection, and possibly design and test an intervention. This gives you a better opportunity to use findings to improve practice than with secondary data analysis – especially if you test an agency’s intervention.
•Novel contribution. Because you can customize your measures, you have a better chance of conducting unique and novel research, which may give you a competitive edge on the national academic job market and help define a program of research for which you will seek funding.
• Online sampling and data collection help. To collect primary data, you don’t necessarily need to recruit your own sample and collect your own data. There are many tools for this including Qualtrics, Amazon mTurk, and Google Surveys.
Cons
• Time. If you collect primary data, chances are you will need a research partner – an agency or organization through which you will recruit research participants, conduct data collection, and possibly design and test an intervention. Chances are, you will be conducting human subjects research and need to seek and secure institutional review board approval. All this takes much more time than analyzing secondary data.
• Sampling problems. If you need to recruit your own sample, it may be too small for using inferential statistics or is not representative of the population you want to better understand (external validity). You may experience study attrition, which is a problem if you want to collect data in more than one wave (e.g., pre and posttest). However, sample size is less of a challenge for qualitative studies.
• Intervention problems. If you collect primary data because you want to study an intervention, there are many things that can go wrong: funding for the program dries up, agency staff leave, a change in agency policies and procedures disrupt the intervention, etc.
• Designing new measures that don’t work well. If you choose to design your own measures, this is a “high risk, high reward” proposition. The process can be very time consuming: extensive literature review, cognitive testing and/or focus groups, pretesting, pilot testing, and testing for reliability and validity. You might make an important scholarly contribution by developing a new measure, but you run the risk of designing an instrument that is not sufficiently reliable or valid.
Secondary Data
Pros
• Time. The data have already been collected so you can focus on data analysis. If you are working with data that have been fully de-identified, chances are pretty good that your study will be determined exempt and non-human participant research by IRB.
• Availability. There’s quite a lot of data that have already been collected. In fact, this is an issue of growing concern among some researchers – how data that is “just lying around” is going unused. This includes data available from research consortia like ICPSR, which has data for nearly 15,000 studies, administrative data from agencies and government sources, and a variety of public available data sets. For example, I’ve analyzed data from the National Financial Capability Survey and the Survey of Household Economics and Decision-making.
• Statistical methods. With a large sample and robust set of variables, you have more opportunities to use multivariate and advanced (e.g., hierarchical linear modeling) statistical methods if this is important for your academic goals.
• More studies post-dissertation. If you find a large dataset that works well for your dissertation and it is data that continues to be collected in the field, you’ll have an opportunity to produce several more studies from the same data. For example, many social work researchers have conducted studies using data from the Fragile Families and Child Wellbeing Study.
Cons
• Data limitations. This is the biggest drawback. Someone else designed the instrument and defined the variables, which may not align well with your research questions or hypotheses. For example, let’s say the main dependent variable I’m going to use for my research questions or hypotheses is social anxiety. I found a secondary data set, but either it does not measure social anxiety specifically and/or the measures are flawed (e.g., not established as reliable and valid or haven’t been used with certain groups of people, there’s a newer and better measure in the field). Thus, the central question is, can you still answer your research questions or hypotheses given the limitations of the data?
•Nothing new here. With publicly available secondary data, you must be very careful not to conduct a study that’s already been published. The ICPSR site tells you what studies have already been published with a certain data source, but this isn’t true for all secondary data sources. For large, well known studies like Fragile Families, it can be very challenging to identify a unique study, which is fundamentally the purpose for your dissertation.
•Qualitative limitations. Using secondary data for qualitative research isn’t very common. It may be hard to find a secondary data source to analyze.
Related sources:
Research Methods Knowledge Base (measurement methods)