For those interested in understanding more about fatal shootings of civilians by police, a natural place to start would be data on the events. Such data is, however, surprisingly hard to access. There are no official sources, but there are several crowd sourced data sets available online. These data sets each contain some common and some unique fields, and they are stored in very different formats. Hence, these data sets require preprocessing before the data can be used for other practical purposes like analysis and visualization. In this research, we focus on cleaning such heterogenous data sets using statistical methods with major focus on data matching. Data matching refers to matching data from different sources and recognizing the matching entities. It is an essential step in data cleaning that helps in resolving the conflicting information provided by the data sets---clerical errors, missing data, differently stored data, etc. Traditional statistical procedures and classification algorithms are based on supervised learning which requires training data. However, it is quite common that the training data is not available at hand and it might not be cost effective to prepare one from scratch. We explore a two step unsupervised learning algorithm that might achieve the same performance as the supervised counterparts. In the first step, it prepares a seed data set containing data points that are relatively extreme matches and non-matches based on discriminating fields. This prepared data set can be used as training data to train classical classification algorithms---k-nearest neighbors, Support Vector machines, Random Forests---which in turn can be used to classify the remaining data points. In the process, we examine missing data, geocode matching, and feature selection methods that increase classification accuracy by selecting the most discriminating fields. Finally, we examine the prospect of generalizing this approach to suit differently aligned data.