I had the pleasure of getting to work on a project for the Center for Advanced Defense Studies (C4ADS) over the months of January and February. The project was taking a dataset that they provided and attempting to train a model that could predict whether or not a company had the capability to manufacture precision materials that could be used in the creation of nuclear fuels and weaponry.
One of the first challenges we faced with this project was a lack of the tools needed to handle a dataset as massive as this one. Our team needed to learn how to use new tools such as Pyspark for handling the large amount of data, and Amazon Web Services’ Elastic Map Reduce (AWS EMR) to provide us the necessary computing power for analyzing the data. We were able to get over this hurdle and continue with the project.
The next challenge we faced was trying to figure out where the previous team left off. We were not the first ones to work on this project, so we had to read through the previous group’s documentation. They did provide us with a good starting place as far as what tech stack they used, and some potential leads that we should check out. Unfortunately they did not give us a good idea of how their work was organized. We had to go through all of their data exploration notebooks to try and figure out what went in, and what came out. This was a huge learning lesson on how important good documentation is. Because of this, my group spent a lot of time creating materials and visualizations showing how the different files morphed as exploration continued. We also spent time writing about our process and why we took the steps that we did.
The final challenge was to figure out how to create a predictive model for this data set. The data set was positive-unlabeled data. This means that of all of the observations, the target class is either the positive class, or it is unlabeled. This can make it challenging to predict and can be tackled with many different approaches. On top of being positive-unlabeled data, we are also performing anomaly detection, which is that there are not many cases of it happening. One plus side of anomaly detection is that we have a really good metric to measure our accuracy. This metric is precision, which takes the number of correct guesses, and divides it by the total number of observations showing the percent of your guesses that were correct.
To deal with positive-unlabeled data we have a selection of solutions. The first is to treat all the data as unlabeled and perform clustering, which is looking for distinct groups or clusters within the data. The most prominent clustering method is K-means. K-means works with continuous numeric data. Another approach is K-modes, which works with discrete data. Our data was a mixture of both categorical and numeric data, which makes it hard to use either method without removing a lot of valuable data. There was also a third clustering method called K-prototype, which can work with both continuous and discrete data. Unfortunately, like the name suggests, this was just a prototype, and did not have the support I was looking for.
Another teammate was working on higher level modeling, so I decided to go back and perform some data exploration. What I found is that many of the columns had been transliterated. Transliteration is when you change foreign characters to phonetically match the english alphabet, so of all the things I could not understand before, I still could not understand, but at least I could pronounce the words. I ended up removing roughly 70 columns due to them being redundant. Many of the columns were slight variants of a different column, so many choices of which column looked more intact were made.
By the end of the project, we had a cleaned, and much smaller, dataset, and had started to make some advances on the modeling. We only had two months to work on this so our time with the project was cut short. We hope that we were able to cut a very clear path for the group that follows us, and allow them to make progress in areas where we were unable.