Cosine similarity is widely used in data mining, recommendation systems, information retrieval .Here we will be discussing Cosine similarity as a proximity measure between two vectors. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1].
A simple example of Cosine Similarity can be explained using document comparison. In a later post we will discuss how to use the same technique to write a movie recommendation engine. Let us say we have multiple documents and we need to determine how similar those documents are. Let us say the documents are document1 and document2 respectively. A document can be represented by a bag of terms or a long vector, with each attribute recording the frequency of a particular term (such a word, keyword or phrase) in the document. So we will be having two term freq vectors (d1 and d2). d1 denotes term frequency in doc 1 and d2 in doc2. But both vectors have only terms common to each other.
Document team coach hockey baseball soccer penalty score win loss season
-------- ---- ----- ------ -------- ------ ------- ----- --- ---- ------
document1 5 0 3 0 2 0 0 2 0 0
document2 3 0 2 0 1 1 0 1 0 1
In an earlier post, we discussed passing additional parameters to MapReduce Job. But there are cases in which we will have to pass some additional files during MapReduce. But since MapReduce runs in multiple nodes, we need to ensure that this additional file that mapper/reducer refers is in that particular node in which its running. In this post we will disucss how to handle this. Let us say we need to find most popular movie from movie-lens database. If you download movie-lens data, there are 2 files in which we are interested in . (u.data and u.item). The format of file is as shown here..
In this post I will be explaining how to add chaining in your map reduce job. That is output of reducer will be chained as an input to another mapper in same job. As an example to explain this I will be improving our regular word count program. In word count program we will get the output as a word and how many occurances of that word in input book. But if we could sort that output based on count, we can easily predict what this books is all about. So let’s get started.
Apache Pig is a tool used to analyze large amounts of data by representing them as data flows. Using the Pig Latin scripting language operations like ETL (Extract, Transform and Load), adhoc data analysis and iterative processing can be easily achieved. Pig is an abstraction over MapReduce. In other words, all Pig scripts internally are converted into Map and Reduce tasks to get the task done. Pig was built to make programming MapReduce applications easier. Before Pig, Java was the only way to process the data stored on HDFS. Pig was first built in Yahoo! and later became a top level Apache project.