TF-IDF Implementation in Java

Question

I have tried following the formulas for Term frequency–Inverse document frequency (TF-IDF) calculation and Cosine similarity calculation, and translated it into code. The results I get seems to be working how it should, but I am worried about having missed something, or if I have not done the TF-IDF calculations or cosine calculations correctly. If someone could please review it, give some pointers or some critique, it would be a huge help

This is my class for calculations

package dk.processfactory.dmrpraktik.util;
import java.util.*;

public class TFIDF<T> {
    private List<String> documents = new ArrayList<>();
    private List<String> vocabularyList = new ArrayList<>();
    private Map<Integer,double[]> tfIdfVectors = new HashMap<>();

    /**
     * Makes documents out of a list of objects, by turning them into strings
     * @param objects
     * @return List of objects as strings
     */
    private List<String> splitIntoDocuments(List<T> objects) {
        List<String> documents = new ArrayList<>();
        for (T object : objects) {
            documents.add(object.toString().toLowerCase());
        }
        return documents;
    }

    /**
     * A function to extract all terms from a document
     * @param document
     * @return a list of all terms in that document
     */
    private List<String> terms(String document){
        List<String> terms = new ArrayList<>();
        //given an object string it will replace symbols with space, to seperate terms
        document = document.replaceAll("[,={}\\[\\]]", " ");
        String[] termsplit = document.split("\\s+");

        for (String term : termsplit) {
            terms.add(term);
        }
        return terms;
    }

    /**
     * Calculates TF value, the frequency of a term within a document.
     * TF = number of times term occurs in document/total number of terms in document.
     * @param term
     * @param terms
     * @return TF value for a term in a document
     */
    private double tf(String term, List<String> terms){
        return (double) occurenceOfTerm(term, terms) / totalTerms(terms);
    }

    private int totalTerms(List<String> terms){
        return terms.size();
    }

    private int occurenceOfTerm(String term,List<String> terms){
        return Collections.frequency(terms, term);
    }

    /**
     * Calculates IDF value, how common a word is in the corpus.
     * IDF = log(total number of documents / documents that contain the term)
     * @param term
     * @param documents
     * @return
     */
    private double idf(String term, List<String> documents){
        int n = documents.size();
        int df = 0;
        for (String document : documents) {
            if (terms(document).contains(term)) {
                df++;
            }
        }
        if(df==0){
            return 0;
        }
        return Math.log(Double.valueOf(n)/Double.valueOf(df));
    }

    /**
     * Uses the function tf and idf to calculate tfidf values for all terms in the documents and creating vectors for each document
     * Where each tfidf value is a dimension in the documents vector
     * @param objects
     * @return A Map where key is the document/objects index and value is its vector
     */
    public Map<Integer, double[]> createTFIDFVectors(List<T> objects){
        documents = splitIntoDocuments(objects);

        Set<String> vocabulary = new HashSet<>();
        for (String document : documents) {
            vocabulary.addAll(terms(document));
        }

        vocabularyList = new ArrayList<>(vocabulary);

        for(int docIndex = 0; docIndex < documents.size(); docIndex++){
            String document = documents.get(docIndex);
            List<String> terms = terms(document);

            double[] tfIdfVector = new double[vocabularyList.size()];

            for(int termIndex = 0; termIndex < vocabularyList.size(); termIndex++){
                String term = vocabularyList.get(termIndex);
                double tf = tf(term, terms);
                double idf = idf(term, documents);
                tfIdfVector[termIndex] = tf * idf;
            }
            tfIdfVectors.put(docIndex, tfIdfVector);
        }
        return tfIdfVectors;
    }

    /**
     * A function to calculate cosine similarity between two vectors, it is used to find the most similar documents
     * A cosine value of 0 means no similarity and 1 means identical. The value can be everything between 0 and 1
     * @param vectorA
     * @param vectorB
     * @return Cosine similarity
     */
    public double cosineSimilarity(double[] vectorA, double[] vectorB){
        double dotProduct = 0.0;
        double normA = 0.0;
        double normB = 0.0;

        for(int i = 0; i < vectorA.length; i++){
            dotProduct += vectorA[i] * vectorB[i];
            normA += Math.pow(vectorA[i], 2);
            normB += Math.pow(vectorB[i], 2);
        }

        return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
    }

    /**
     * Used to create a vector with the corbus from a document.
     * @param document
     * @return A vector representing the document
     */
    private double[] createVectorFromCorbus(String document){
        List<String> terms = terms(document);
        double[] tfIdfVector = new double[vocabularyList.size()];

        for(int termIndex = 0; termIndex < vocabularyList.size(); termIndex++){
            String term = vocabularyList.get(termIndex);
            double tf = tf(term, terms);
            double idf = idf(term, documents);
            tfIdfVector[termIndex] = tf * idf;
        }
        return tfIdfVector;
    }

    /**
     * Used to list the most similar documents to the given object in the corpus
     * @param object the given object that represents a document
     * @return a list with indexes of documents, where most similar is first.
     */
    public List<Integer> listMostSimilarDescend(String object){
        double[] objectVector = createVectorFromCorbus(object);
        Map<Integer,Double> cosineSimilarities = new HashMap<>();

        tfIdfVectors.forEach((document, vector) -> {
            double cosineSimilarity = cosineSimilarity(objectVector, vector);
            cosineSimilarities.put(document, cosineSimilarity);
        });

        List<Integer> mostSimilar = cosineSimilarities.entrySet().stream()
                .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
                .map(Map.Entry::getKey)
                .toList();

        return mostSimilar;
    }

}

This is an implementation so you can see the purpose

public static void main(String[] args) {
    SpringApplication.run(TfidfApplication.class, args);
    String doc1 = "The cat sat on the mat";
    String doc2 = "The dog sat on the mat";
    String doc3 = "dog and dogs are animals";
    
    List<String> documents = Arrays.asList(doc1, doc2, doc3);
    TFIDF<String> tfidf = new TFIDF<>();
    
    Map<Integer,double[]> tfIdfVectors = tfidf.createTFIDFVectors(documents);
    
    System.out.println("TF-IDF Vectors:");
    
    for(Map.Entry<Integer,double[]> entry: tfIdfVectors.entrySet()){
        System.out.println("Document "+entry.getKey());
        double[] tfidfvector = entry.getValue();
        for (double v : tfidfvector) {
            System.out.print(v + " ");
        }
        System.out.println();
    }
    
    System.out.println("Cosine between doc0 and doc1: "+tfidf.cosineSimilarity(tfIdfVectors.get(0), tfIdfVectors.get(1)));
    System.out.println("Cosine between doc1 and doc2: "+tfidf.cosineSimilarity(tfIdfVectors.get(1), tfIdfVectors.get(2)));
    System.out.println("Most similar documents: "+tfidf.listMostSimilarDescend("cat sat the mat"));
}

Results of main function:

TF-IDF Vectors:

Document 0
0.06757751801802739 0.06757751801802739 0.06757751801802739 0.0 0.0 0.1831020481113516 0.06757751801802739 0.0 0.0 0.0 0.06757751801802739

Document 1
0.06757751801802739 0.06757751801802739 0.06757751801802739 0.0 0.0 0.0 0.06757751801802739 0.0 0.0 0.06757751801802739 0.06757751801802739

Document 2
0.0 0.0 0.0 0.21972245773362198 0.21972245773362198 0.0 0.0 0.21972245773362198 0.21972245773362198 0.08109302162163289 0.0

Cosine between doc0 and doc1: 0.5810469954347838

Most similar documents: [0, 1, 2]

Relying on toString in Object to String conversion reduces the value of the generic parameter T. Instead provide a Function<T, String> that performs the conversion (it could even default to lambda Object::toString). — TorbenPutkonen
– TorbenPutkonen, Commented Dec 1, 2024 at 10:20
You misspelled the words occurrence and corpus in the method names. — Alexander Ivanchenko
– Alexander Ivanchenko, Commented Dec 1, 2024 at 23:01
Welcome to Code Review! it appears you have a registered account (as evidenced by the suggested edit), which can be merged with your unregistered account. You can use the contact SE page and request the accounts be merged. — Sᴀᴍ Onᴇᴌᴀ
– Sᴀᴍ Onᴇᴌᴀ ♦, Commented Dec 2, 2024 at 16:55

Alexander Ivanchenko · Accepted Answer · 2024-12-01 19:19:36Z

Design

There's no benefit in making TFIDF class Generic

You're not using generic type parameter <T> declared on the class level in any meaningful way. Method createTFIDFVectors() expects a List<T> as parameter, but this type T gets lost after the elements of the list converted to String.

More over, if you create a TFIDF instance like this:

TFIDF<String> tfidf = new TFIDF<>();

guess what type of List can be provided as a parameter when you call createTFIDFVectors() on it?

You'll get a compilation error if you try to give anything which is not a List<String>.

If you have a requirement that createTFIDFVectors() should accept an List of arbitrary objects, then you can ditch the class level generic type parameter and use List<?> (see unknown type)as a parameter type in createTFIDFVectors(), rather than List<T>. This way it will be able to work with a list of any type.

But I have to point out that idea of treating any Java type as something that can be converted to a document is a bit strange. I would probably go with List<String> or List<? extends CharSequence> instead.

Class API

Your TFIDF class lacks a well-defined API. It’s not very clear how to interact with it.

Let's have a look at the following public methods it exposes:

createTFIDFVectors(List)
listMostSimilarDescend(String)
cosineSimilarity(double[], double[])

Out of these three createTFIDFVectors() can be used to modify the internal state reassigning the vocabulary and updating the map of vectors, yet neither the method name nor even its documentation communicate this intent.

The Javadoc only says that it's meant to "calculate tfidf values for all terms in the documents".

And if we examine the createTFIDFVectors() implementation, it turns out that it can leave the object in an inconsistent state.

You're reassigning the field vocabularyList and partially updating the map of vectors tfIdfVectors. Depending on how large the previous vocabulary was in comparison to a new one, after update the map might contain data related to the previous vocabulary.

Computations

IDF

There are several issues related to the idf() implementation and its usage in the createTFIDFVectors().

private double idf(String term, List<String> documents){
    int n = documents.size();
    int df = 0;
    for (String document : documents) {
        if (terms(document).contains(term)) {
            df++;
        }
    }
    if(df==0){
        return 0;
    }
    return Math.log(Double.valueOf(n)/Double.valueOf(df));
}

Presented computation of IDF is inefficient because:

Method idf() repeatedly generates a new list of vocabulary terms by calling terms(document) in a loop for every document. Instead, you should construct the vocabulary for each document only once and reuse it.
terms(document).contains(term) - is a linear search done on a List. You should consider using a HashSet.
Method idf() is invoked inside the nested for-loop in the createTFIDFVectors() for every occurrence of each term. I.e. needlessly calculates IDF of terms that a present in multiple documents multiple times. This should not happen.
One more minor issue: Double.valueOf(n)/Double.valueOf(df) there is no need to create these Double-wrappers, instead you should cast int into primitive double.

TF

Computation of the term frequency can also be improved. You're calling tf() for each element in the vocabularyList.

private double tf(String term, List<String> terms){
    return (double) occurenceOfTerm(term, terms) / totalTerms(terms);
}

private int occurenceOfTerm(String term,List<String> terms){
    return Collections.frequency(terms, term);
}

You can calculate term frequency in a document for each term it contains by generating a map instead of performing iteration through the document (that's what Collections.frequency() does) for each vocabulary term.

Map<String, Double> tfByTermPerDocument = terms.stream()
    .collect(groupingBy(
        identity(), 
        collectingAndThen(counting(), c -> (double) c / terms.size())
    ));

Stack Exchange Network

TF-IDF Implementation in Java

1 Answer 1

Design

Computations

You must log in to answer this question.

Hot Network Questions

TF-IDF Implementation in Java

1 Answer 1

Design

Computations

You must log in to answer this question.

Related

Hot Network Questions