One of the most surprising points that Cohen made in his essay, “From Babel to Knowledge: Data Mining Large Digital Collections” is summarized in this quote:
“As the size of a collection grows, you can begin to extract information and knowledge from it in ways that are impossible with small collections, even if the quality of individual documents in that giant corpus is relatively poor.”
From the moment we start school and are asked to do research on a topic there is one phrase teachers constantly say, “quality over quantity”. As students and professionals, we are always asked to evaluate and assess our sources before using them and not to collect bunch of sources that may not have “good” information. It comes as a shock then, when Cohen asserts that in the context of data mining, a high volume of “poor” quality sources is actually more valuable than a small volume of very good sources. With this in mind, we can say that having 100 hundred mediocre sources that claim Abraham Lincoln was born in 1809 would have more worth than having twenty scholarly sources say the same thing. Although it seems backwards, in a way it makes sense because relying on a few sources, even if they are written by experts, can have its disadvantages. No matter how studied someone is in a subject, they are bound to make mistakes, have biases and make assumptions so it can be helpful to check their claims with a variety of other sources. I have to say though, it seems the use of quantity over quality in research would probably be most helpful to find facts such as dates and locations and less helpful for finding deeper analyses and interpretations of facts.
Another interesting thing about Cohen’s argument of quantity over quality is related to digitalizing texts themselves. In their book “Digital History”, Cohen and Rosenzweig describe the various methods for digitalizing analogues. For each method, there are benefits and consequences, the benefits usually being easier search capabilities or convenience and the shortcomings being a lack of manipulability or expense. Overall though it seems that spending more time and money on digitalizing a text would be better in the long run because the text would be easier to search and manipulate. In this essay however, Cohen asserts that it would be better to use low-quality methods to digitalize a greater number of texts than to use high-quality digitalization for a smaller number of texts. He supports this theory with the fact that with a greater number of texts to work, data mining can find more patterns and repetitions that can be useful for research. As a result, rules and conventions that we believe to be indisputable are challenged by the invention of new technologies and their application to the academic world.
Leave a Comment