wikiwebwalker

# date_posted == "2024-04-23"
# last_updated == "2024-05-11"
This blog post (and project) is a work in progress! Check back later for more updates. Currently working on this blog post, tidying up the codebase, and a final (and more formal) write-up!

RESEARCH QUESTION: Can semantic similarity and contextual relatedness analysis be used to more quickly determine paths between articles in an interlinked database (such as Wikipedia)?

This is a project I did as part of COMPSCI 485: Applications of Natural Language Processing (NLP) taught by Brendan O'Connor in the Spring 2024 semester. I worked on this project alongside two of my friends: Nate Courchesne and Jacob Grassi.

Background and other topic contexts

The final project for this class's main goal is to apply some natural language processing (NLP) system to some task. Jacob came up with the idea to create some NLP model that can beat The Wiki Game. The Wiki Game is a game where you're given an entry point Wikipedia article and an exit point Wikipedia article. Your main goal is to go into the entry point article, traverse through the various links within it, and get to the exit point article in as few steps/hops from article-to-article as possible.

The typical game meta usually is to find links within the entry point that contextually make sense with the exit point (i.e. "Units of measurement" inter-node compared to "Microscope" exit node since both deal with "micro-"). Another strategy is to find just general synonyms in the inter-nodes to the exit point article name. We aimed with this project to simulate that meta by utilizing semantic similarity. Semantic similarity is a measure of how frequently words co-occur with each other, where words that are (near) synonyms score the highest. It can be used in a lot of areas such as topic and sentiment analysis, machine translation, and optimizing searches. The core "research" question we wanted to answer to help us solve The Wiki Game was mentioned in the beginning of this post, where we try to use semantic similarity and contextual relatedness analysis to have a more efficient search of paths given nodes in an interlinked database.

Due to the nature of general user traversal while trying to beat The Wiki Game, we initially wanted to use a search algorithm that prioritizes looking at all the nodes on a layer before traversing deeper into a network. Because of this specific search property we wanted to use, we wanted to initially use breadth-first search since it does exactly that (visiting all connected nodes of graph in level-by-level manner).

Overall, this is a network science/graph search and semantic similarity problem. We want to create some search algorithm that steps through each article, gather all of the links, compare them via some semantic similarity process to the exit article's name, and use that "scoring" as a sort of heuristic to guide us further down our search. If our score becomes further away than the previous article layer of inter-nodes, we backtrack and go back to the previous article.

The scoring heuristic and backtracking parts of our problem make our search algorithm more similar to A* search which involves past knowledge and finds the shortest path after calculating the score of neighboring nodes. Now, our proposed solution can be summed up by:

A* search on Wikipedia article nodes with semantic similarity scores (compared to exit article) as a heuristics function.

Attempted murder on my laptop (downloading 17,000,000+ Wikipedia articles)

To get all of the Wikipedia articles, we first had the idea to build some web crawler utilizing webdrivers like Selenium. This approach would be great and solve the Wikipedia article storage issue. We opted to not do this approach however, because we were a bit worried about Selenium querying requests a bit too much and overloading the Wikipedia servers (since they say NOT to scrape their site to obtain large amounts of articles).

Please do not use a web crawler to download large numbers of articles. Aggressive crawling of the server can cause a dramatic slow-down of Wikipedia. (Source)

The next attempt was using another pip library called Wikipedia (aptly named), but that wraps the MediaWiki API. According to their Phabricator board, this library does not seem to be very good at fetching articles given some URL. Someone tried to make a request for this to be fixed, but they made it a feature request instead of an actual issue :(.

Ultimately, we opted to download a bunch of static files from the Wikimedia Downloads/Dumps page. One singular dump file is a lot of files (23gb compressed...). I downloaded one of the big current XML dump file, kept it on my local system, and extracted them (very slowly) to my own device (probably not the smartest idea but I dealt with the consequences of that later). After dealing with the ramifications of my laptop slowing down to an (almost) hault, I moved the files over to an external SSD...which I probably should have done in the first place. If I were smarter and/or had more time, I would have offloaded all of it on a server I had back home and given ssh credentials to my friends so we could all access them -- but we were in a pinch. It's okay though because now I have random Wikipedia articles on an SSD for easy access if I don't have an internet connection. Now I can read about Phonemes or Pope John Paul II if I'm on an airplane or something.

After downloading the initial Wikipedia XML dump file, I used a tool called Wikiextractor that helped preprocess and extract all of the articles really nicely given the zipped XML dump. This made my life really easy because each file that was spit out from the tool contained several Wikipedia files -- all we had to do was write some regex parser (thanks Nate) to separate each article from the larger files. We could then feed these articles and their titles into our graph search and semantic similarity algorithms.


    < doc id="1687791" url="https://en.wikipedia.org/wiki?curid=1687791" title="Lord God Bird">
        Lord God Bird
        
        Lord God bird may refer to one of two similar-looking large woodpeckers of North America:
        Lord God Bird may also refer to:
        
    < /doc>
Example of one of the articles that we can pull from the aggregated/combined large articles extracted via Wikiextractor. What could Lord God Bird also refer to...? The world will never know since the Wikiextractor didn't finish this one.... Other articles seemed to parse fine, but this article's bullet points seem to not have parsed correctly. I guess Lord God Bird is too powerful for Wikiextractor.🐦

Semantic similarity shenanigans

For our semantic similarity scoring function, we need to use some word embedding transformer. Word2vec is an example of a word-to-vector embedding transformer, and is a group of related models that learns word embeddings when given groups of words by capturing their semantic and syntactic qualities. This would be good to use if we were going to be using a Bag-Of-Words (BOW) representation model for our Wikipedia articles, but since we want contextual understanding in each article and not take potentially wrong connotations for certain articles, we want to use something else.

Since we want contextual understanding as well as semantic similarity, we want to use a bidirectional representation for each word when we calculate their embeddings (i.e. getting the context on the left and right in all layers). Thus, we decided to use the Bidirectional Encoder Representations from Transformers (BERT) language representation model (okay we actually used DistilBERT because it's way smaller and faster and only has like a 3% degredation from BERT so it's good enough for my intents and purposes!).

DistilBERT used in conjunction with the SentenceTransformer Python library will help us produce a model that can compute the word embeddings for the documents (in this case, Wikipedia articles) we feed it. This will help us with our similarity checking/scoring pipeline later. I used the multi-qa-distilbert-cos-v1 pretrained model sourced from this list of pretrained models because it had good speed and accuracy according to their benchmarks. It was also tuned on a criteria we're looking for (comparing some list of articles to a target article, which is similar to the semantic search they said this model is used for when answering some query/question.


import os
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import random

# using distilbert stsb, will help with embeddings later
model = SentenceTransformer('multi-qa-distilbert-cos-v1')
        

For taking the similarity of a bunch of articles on a single layer (layer being an article and every link that shows up on it), I simulated a singular layer by sampling random amounts of links from the dataset I downloaded earlier. I created some fake "exit" article (some string, I didn't use an article yet for it) to compare all these articles to later.


#### CREATING SAMPLE DOCUMENTS ####

# document1: array of all links/articles on current layer
document1 = []
with open('test_files/test.txt', 'r') as f:
    for line in f:
        document1.append(line)

random.shuffle(document1)
document1 = document1[:len(document1)]   # for 100, change it to :len(document1)//4 for 25, etc...
print(f"Amount of articles in this search layer: {len(document1)}")

# document2: simulated exit point article, will be replaced with an actual article later
document2 = "I am a student at University of Massachusetts Amherst"     
        

For calculating similarity scores, I used something called cosine similarity, which takes the normalized dot product of vector V and W and calculates the similarity. This is measured by the following formula: K(V, W) = / (||V||*||W||). This is the basis of all of the similarity checking we will be doing.


#### COSINE SIMILARITY -- CALCULATING SCORE BETWEEN EMBEDDINGS ####

# exit_embedding: the embedding of the exit point article
# link_embeddings: the embeddings of all articles in current layer
# returns indices of link_embeddings articles in order of most similar to least to exit_embedding

def find_similar(exit_embedding, link_embeddings):
    similarity_matrix = cosine_similarity(exit_embedding, link_embeddings)
    print(f"Cosine Similarity Matrix: {similarity_matrix}")
    return np.flip(similarity_matrix.argsort())

Now we can write the actual functionality of the code utilizing this cosine similarity function. In order to use the find_similar() function, we need to pass in the embeddings of the exit point article and array of articles in the current layer. We can do that with the following:


print("Encoding document1 (BFS'ed articles)...")
link_embeddings = model.encode(document1)     # will be replaced with BFS Search links later

print("Encoding document2 (exit)...")
exit_embedding = model.encode([document2])  # careful when scaling up to full doc

Now we can just run the function on these obtained embeddings and write the most similar article to a file to save it! We will edit this code later so that the score of the most similar article (the semantic similarity score) will be what is passed as a heuristic to our graph search. We want this array of most similar articles to persist though in case we need to do some backtracing.


### GETTING SIMILARITY SCORES ###
sim_indices = find_similar(exit_embedding, link_embeddings)   # persistent list of most similar indices

# the MOST similar article
mostsim_index = sim_indices[0]
mostsim_article = document1[mostsim_index]   
print(f"Similarity score of most similar article: {sim_indices[0]}")
print(f"Most similar article to {document2}: in mostsim.txt")
with open("mostsim.txt", "w") as f:
    f.write(mostsim)    # don't forget to run the following after each call or else it won't overwrite:
                        # rm mostsim.txt && touch mostsim.txt

And that's how we did the semantic similarity scoring for the embeddings of the articles on the current letter vs. the exit article's embeddings! We just need to integrate this part of the code to act as heuristics for the next section, the A* graph search.

If you were a graph search algorithm, I think you'd be A*!

Frankenstein-ing it all together

Reflection