NORMALIZED COMPRESSION DISTANCE WITH TYPOGRAPHICAL ERRORS: A PRELIMINARY ASSESSMENT

Robert P. Batzinger, James Wolfer

Abstract


Rooted in the principles of KolmogorovComplexity, the Normalized Compression Distance (NCD)has emerged as a versatile similarity metric in problemdomains ranging from music to DNA analysis. Studies in theliterature have indicated that the NCD exhibits goodperformance in the face of large, randomly corrupted,documents. This work, as part of a larger effort to explorethe viability of the NCD as a similarity metric for analyzingand clustering medical documents, presents a preliminaryassessment of NCD performance on smaller, sub-document,samples subject to real-world style typographical errors.When applied to an encyclopedia of sewing and TheFederalist Papers the NCD exhibited graceful degradationwith increasing typographical error rates and decreasingminimum paragraph size.Index Terms ⎯ Document Analysis, Normalized Compression Distance, Text Analysis, Typographical ErrorAnalysis.

Full Text: PDF

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.