Hi, I think you sent this to the wrong person. [cid:storage_emulated_0__EmailTempImage_5_TempSignature_signature_20200630_205218_jpg_1593543161247] [cid:storage_emulated_0__EmailTempImage_5_TempSignature_signature_20200630_205334_jpg_1593543223538] [cid:storage_emulated_0__EmailTempImage_5_TempSignature_signature_20200630_205420_jpg_1593543265258][cid:storage_emulated_0__EmailTempImage_5_TempSignature_signature_20200630_205456_jpg_1593543303538] Kind Regards, Raine Pretorius [cid:storage_emulated_0__EmailTempImage_5_TempSignature_signature_20200630_205658_jpg_1593543438262] -------- Original message -------- From: Peter Otten <__peter__ at web.de> Date: 2020/07/02 11:09 (GMT+02:00) To: python-list at python.org Subject: Re: trying to improve my knn algorithm kyrohammy at gmail.com wrote: > This is another account but I am the op. Why do you mean normalize? Sorry > I?m new at this. Take three texts containing the words covid, vaccine, program, python Some preparatory imports because I'm using numpy: >>> from numpy import array >>> from numpy.linalg import norm The texts as vectors, the first entry representing "covid" etc.: >>> text1 = array([1, 1, 0, 0]) # a short text about health >>> text2 = array([5, 5, 0, 0]) # a longer text about health >>> text3 = array([0, 0, 1, 1]) # a short text about programming in Python Using your distance algorithm you get >>> norm(text1-text2) 5.6568542494923806 >>> norm(text1-text3) 2.0 The two short texts have greater similarity than the texts about the same topic! You get a better result if you divide by the total number of words, i. e. replace absolute word count with relative word frequency >>> text1/text1.sum() array([ 0.5, 0.5, 0. , 0. ]) >>> norm(text1/text1.sum() - text2/text2.sum()) 0.0 >>> norm(text1/text1.sum() - text3/text3.sum()) 1.0 or normalize the vector length: >>> norm(text1/norm(text1) - text2/norm(text2)) 0.0 >>> norm(text1/norm(text1) - text3/norm(text3)) 1.4142135623730949 -- https://mail.python.org/mailman/listinfo/python-list

