When writing an O’Neill Honors Thesis, you will most likely find yourself dealing with subsets of data or subsets of subsets of data. At the beginning of my data analysis journey, I received some advice from my thesis advisor regarding subsets. However, did I follow this advice? No. Not only did I brush aside perfectly good advice about a process that I had zero experience with, I paid for it greatly during the last few weeks of V499. So, in order to prevent others from making the same mistakes I did, I am going to relay my advisor’s advice to you and hope that you are better listeners than me. If you choose to ignore advice from an advisor with years of research experience and “Dr.” in front of their name, I hope that you will at least listen to me, your fellow O’Neill Honors Student.
Index. Index. Index index index. When working with large, archived datasets, it is likely that you might have empty cells of missing data that were either not collected or not recorded. If this happens to you like it happened to me, indexing will be your friend. For every set of data or subset of data, make sure that you know which variables have all the information and which do not. Record this information and keep it in a spreadsheet that you can look back on throughout your writing process. If this is confusing, I will try to provide an example from my own research to help you understand just how important this is.
My research included air emissions data for counties in Indiana. However, not every county has recorded data on air emissions. What I should have done before analyzing the data was make a list of every air emissions variable and which counties had data available for those variables. In short, make an index. However, as stated before, I declined to do so. When I copied my data from Excel into a new data analysis software, I did not check to see how many rows of data I should have. If I had indexed before analysis, I would have noticed that I had fewer rows in the new software than what I needed.
The aftermath from this grave mistake is that I accidentally left out data points when generating p values. Therefore, the regression plots I made in Excel did not match the p values from the other software. Once I finally realized what happened, it was almost too late. In the end, everything worked out and I was able to check and double-check my numbers before it was time to submit my final manuscript. However, I spent hours and hours correcting these errors in the final countdown leading up to the submission date, and I was ultimately left with an uneasy feeling that I was still missing something important in my data. Do not be like me. Listen to your advisor, especially if you don’t know as much about collecting and analyzing data, and I guarantee you don’t.
Leave a Reply