logo

Social  Web:  Where  are  the  Semantics?

Tutorial within the ESWC 2014



Code Snippets: Implicit semantics

Task 3: Working with topic models

We are using the Mallet library for this example. You will need to link mallet.jar and mallet-deps.jar to your application.

Snippet #3.1: Calculate the model

The first snippet defines a class with a few attributes: the memory model itself, the alphabet with all the used words, the list of documents and topic assignments as well as a pre-defined number of topics.

Then, a method is provided to calculateModel, which accepts as input the name of a file with the Mallet format: one document per line. Each line is made up of three fields separated by tabulations. The first field is an id of the document, the second a class (if classification tasks are to be performed later on) and the third field is the text itself (with no tabulations nor newline characters).


public class TopicModels {

    //The topic model
    static ParallelTopicModel model;
    //The alphabet: list of used words
    static Alphabet alphabet;
    //The documents: list of documents
    static InstanceList documents;
    //The assignments: list of assignments of topic to documents
    static List assignments;
    //Number of topics to be used
    static int numTopics = 50;
	
    /**
     * Calculates the LDA model 
     * @param sfile Name of a file containing the data in Mallet format, each line being:[ id \t class \t text \n ]
     */
    public static void calculateModel(String sfile) throws Exception {
        ArrayList pipeList = new ArrayList();
        pipeList.add(new CharSequenceLowercase());
        pipeList.add(new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")));
        pipeList.add(new TokenSequenceRemoveStopwords(new File("en.txt"), "UTF-8", false, false, false));
        pipeList.add(new TokenSequence2FeatureSequence());                                                    
        documents = new InstanceList(new SerialPipes(pipeList));
        Reader reader = new InputStreamReader(new FileInputStream(new File(sfile)), "UTF-8");
        documents.addThruPipe(new CsvIterator(reader, Pattern.compile("^(\\S*)[\\s,]*(\\S*)[\\s,]*(.*)$"), 3, 2, 1)); 
        model = new ParallelTopicModel(numTopics, 1.0, 0.01);
        model.randomSeed = 0;
        model.addInstances(documents);
        model.setNumThreads(2);
        model.setNumIterations(100);
        model.estimate();                               //The model is calculated here
        alphabet = documents.getDataAlphabet();
        assignments = model.getData();
    }
}	

Snippet #3.2: See the info on a particular document

The following method can be added to the class: it shows information on a particular document. This information comprises the actual tokens (the words, save stopwords etc.), and the participation of the message in each of the topics.


    /**
     * Shows information on the i-th document
     * @param iDocument Number of document whose information wants to be shown
     */
    public static void showDocument(int iDocument) {
        System.out.println("INFO ON DOCUMENT #" + iDocument + " ========");
        //First we show the words in the document
        TopicAssignment mensaje = assignments.get(iDocument);
        LabelSequence topics = model.getData().get(iDocument).topicSequence;
        FeatureSequence tokens = (FeatureSequence) mensaje.instance.getData();  
        for (int i = 0; i < tokens.getLength(); i++) 
            System.out.print(tokens.getObjectAtPosition(i) + "(" + topics.getIndexAtPosition(i) + ") ");
        System.out.println();
        //Second show the probabilities for each topic
        double[] topicDistribution = model.getTopicProbabilities(iDocument);     
        for (int i = 0; i < numTopics; i++) {
            if (topicDistribution[i] > 0.1) {
                System.out.println(String.format("Topic%d: %.3f ", i, topicDistribution[i]));
            }
        }
    }

Snippet #3.3: See more info on a particular topic

A topic is a mere list with the probabilities of each of the words in the alphabet. This snippet shows the most relevant words for a topic.


    /**
     * Shows information on a certain topic
     * @param iTopic Number of topic whose information wants to be shown
     */
    public static void showTopic(int iTopic) {
        System.out.println("INFO ON TOPIC #" + iTopic + " ========");
        ArrayList< TreeSet  > topicSortedWords = model.getSortedWords(); //Array of sorted sets of word ID/count pairs
        Iterator iterator = topicSortedWords.get(iTopic).iterator();
        int rank = 0;
        while (iterator.hasNext() && rank < 5) {
            IDSorter count = iterator.next();
            System.out.println(String.format("%s (%.0f)", alphabet.lookupObject(count.getID()), count.getWeight()));
            rank++;
        }
    }

Snippet #3.4: See the topics for a new document

Once calculated the topics model, it is possible to calculate the participation in the topics of a new document


    /**
     * Shows the inferred topics for a new document
     * @param text Text of the new document
     */
    public static void showInferredTopics(String text) {
        System.out.println("INFO ON A NEW TEXT" + " ========");
        InstanceList testing = new InstanceList(documents.getPipe());
        testing.addThruPipe(new Instance(text, null, "new text", null));
        TopicInferencer inferencer = model.getInferencer();
        double[] testProbabilities = inferencer.getSampledDistribution(testing.get(0), 10, 1, 5);
        for (int i = 0; i < 50; i++) {
            if (testProbabilities[i] > 0.1) {
                System.out.println(String.format("Topic%d: %.3f ", i, testProbabilities[i]));
            }
        }
    }

License: The contents in this page are licensed under a CC-BY license. Disclaimer We provide this code without any warranty. Use it at your own risk.

Photo used under Creative Commons CC-BY license from youasamachine