Spark Program to Find Anagrams within a File

Balasubramaniyan Sellamuthu
1 min readJan 29, 2020

Step 1 : If you are creating standalone application, using SparkSession initiate the spark context. If you are using spark-shell , the SparkSession would be automatically initiated for you.

Step 2: Using SparkContext ( sc ) read the file( Here I have used sherlock-holmes.txt file ) and create a RDD. Using cache, we can have better performance.

val holmes=sc.textFile(“../Resources/sherlock-holmes.txt”)

Step 3: Transform the above RDD using map and filter, create another RDD with tuple having key as the word with alphabets reversed and value as List collection with the input word . For example, if the word is Draft, then the key would be adfrt and the value would be draft.i.e (k,v) => ( adfrt, List(draft )).

val vocabulary=holmes.
map(f =>f.replaceAll(“[^A-Za-z]+”,” “))
.filter(_.nonEmpty)
.flatMap(_.split(“ “))
.distinct
.map(f =>(f.toString.sorted.mkString,List(f)))
.cache
//.take(100)

Step 4: Reduce the above tuple with ReduceByKey action by concatenating the List based on the key and then remove the key here ( as we don’t that alphabetically sorted key ) and create a key value pair with the List of Anagrams and count of the words in that Anagrams and say we want to sort them with top 100 most number of anagrams, then we can use sortBy and take function

voca.reduceByKey((x,y) => (x ::: y))
.map(f => ( f._2.length,f._2))
.sortBy(_._1,false).take(100)

--

--