Finding Internal Link Opportunities at Scale with Vector Search

It seems that all the rage when it comes to AI and SEO has been around using it for some form of text generation. But, one of the most interesting features that I have yet to see really discussed is the usage of embeddings and vector search.

What are Emebddings?

To understand what vector search is, you first need to know what embeddings are.

Embeddings are essentially the translation of bodies of text (which I’ll call documents) into numbers which allows algorithms to better understand the content of the document.

These documents could be as short as an H1 to as long as an in-depth article.

What is Vector Search?

Once you have these embeddings (i.e. number representations of your documents), a vector search is the comparison of the numbers against other numbers (i.e. comparing documents against each other) to find the similarity of them.

The higher the similarity of these numbers the more likely they are related.

If you’d like to dive deeper into the nitty gritty details of how vector search works, you can read more about it on OpenAI’s blog.

Why Use Vector Search for Internal Links?

So why the heck should you use vector search instead of using something like ScreamingFrog + regex?

Well… instead of trying to find cases of whether a keyword is on a page or not, you’re now able to find opportunities based on semantic similarity. In plain English that means you can flag internal links based on topical similarity.

How To Find Internal Linking Opportunities

The follow sections provides you with a step by step breakdown of this GitHub repo and how the script works. Please note that the repo is simply a proof of concept and would need to be refined further to be production ready.

1. Exporting & Prepping Your Documents

In my case, WordPress is typically the go to CMS for the clients that I work with and the platform thankfully allows you to export all of the pages or posts as an XML document.

Once exported I parse the XML file into an easy to use JSON object which I then parse and remove all of the internal links from the text:

// Get XML file
let articlesXml = await fs.readFileSync(ARTICLE_POSTS, 'utf8');

// Parse XML file to JSON
let articlesJson = await convertXml.xml2js(articlesXml, {compact: true, spaces: 4, ignoreComment: true})

// Map Reduce (HTMl to Text + Parse Internal Links)
let formattedArticles = articlesJson.rss.channel.item.map((article) => {
    return { ...article,
        articleText: convertHtml(article['content:encoded']._cdata, {linkBrackets: false, ignoreHref: true})
    };
});
Code language: JavaScript (javascript)

2. Translate Your Documents into Embeddings

Once you have the documents ready, you then need to get the embeddings for them (AKA translating them into numbers).

OpenAI provides you with an easy to use Embeddings end point where you simply provide them with the document and they return the embedding version.

You can see how to do that here:

// OpenAI Vectorize + Push to Pinecone
for (let article = 0; article < formattedArticles.length; article++) {

    // Create embedding via OpenAI
    let embedding = await openai.embeddings.create({
        model: 'text-embedding-ada-002',
        input: formattedArticles[article].articleText,
        encoding_format: 'float'
    });

    // Adde embedding data to JSON object
    formattedArticles[article].embedding = embedding

}
Code language: JavaScript (javascript)

3. Save Your Embeddings to a Vector Database

Now that you have the embeddings, you can save them into a vector database, in my case I’m using Pinecone.

Not only do you want to push the embeddings to Pinecone but, you also want to make sure the ID you’re using can easily be cross-referenced (pro tip: use the unique ID of the document from the CMS as the ID in Pinecone) and you may also want to include additional meta data about the document such as the category or tags from your CMS.

  // Chunk the articles
  const chunkedArticles = formattedArticles.reduce((chunkedResults, article, index) => { 

    // Set the chunk size
    const chunkIndex = Math.floor(index/50);
    
    // Start a new chunk
    if(!chunkedResults[chunkIndex]) {
        chunkedResults[chunkIndex] = [];
    }
    
    // Add the article to the chunk
    chunkedResults[chunkIndex].push(article)
    
    return chunkedResults
}, []);

// Target a Pinecone index
const pineconeIndex = pinecone.index(PINECONE_INDEX);

// Send the chunks to Pinecone
for (const chunk of chunkedArticles) {

    // Create an empty embeddings array
    let embeddings = [];

    // Push the embeddings of each article to the embeddings
    for (const article of chunk) {
        embeddings.push({
            id: article['wp:post_id']._text,
            values: article.embedding.data[0].embedding,
            metadata: {
                category: article.category._cdata.toLowerCase()
            }
        });
    }

    // Push embedding to Pinecone
    await pineconeIndex.upsert(embeddings);

    // Provide confirmation of saving
    console.log(`Pushed ${chunk.length} article embeddings to Pinecone`);
}

// Save data to a JSON file
fs.writeFileSync('./output/article-embeddings.json', JSON.stringify(formattedArticles));Code language: JavaScript (javascript)

4. Compare Your Link Target Embedding with Your Vector Database

This is where the rubber finally hits the road, you then take the WordPress post ID, which should also be the ID of the document in Pinecone, of the URL you’re trying to find links (I’ll call this the target document) to and you request that Pinecone provides you with documents that are similar to it. In my case I am requesting the top 50 similar documents.

Pinecone will then send you a whole slew of results back with a score between 0 and 1, where 0 is irrelevant and 1 is identical.

The list is great but, next we need to filter them do to actual opportunities. I do this by:

Excluding the target document itself
Removing results that are below a certain score threshold (I recommend above a 0.7 at the minimum)
Removing results that are already linking to your target document
Cleaning up the results into something that is human readable

// Get matched opportunities from Pinecone
let opps = await pinecone.index(PINECONE_INDEX).query({ topK: 50, id: TARGET_ARTICLE_ID})

// Get Target Article Info
let targetArticleInfo = formattedArticles.filter(function(target) {
    return target['wp:post_id']._text === TARGET_ARTICLE_ID
})

// Filter
let filteredOpps = opps.matches.filter(function(opp) {
    // Remove target article & articles below the scoreThreshold
    return opp.id !== TARGET_ARTICLE_ID && opp.score >= SCORE_THRESHOLD;
})

// Merge Pinecone Results + WP Data
let finalOpp = filteredOpps
    // Remove the target article from the opps
    .filter(opp => formattedArticles.some(wp => wp['wp:post_id']._text === opp.id))
    // Add WP link, title and HTML
    .map(finalOpp => ({
        targetUrl: targetArticleInfo[0].link._text,
        ...finalOpp,
        link: formattedArticles.find( wp => wp['wp:post_id']._text === finalOpp.id).link._text,
        category: formattedArticles.find( wp => wp['wp:post_id']._text === finalOpp.id).category ? formattedArticles.find( wp => wp['wp:post_id']._text === finalOpp.id).category._cdata : '',
        title: formattedArticles.find( wp => wp['wp:post_id']._text === finalOpp.id).title._cdata,
        htmlContent: formattedArticles.find( wp => wp['wp:post_id']._text === finalOpp.id)['content:encoded']._cdata
    }))
    // Remove articles already linking to target
    .filter(finalOpp => {
        return !finalOpp.htmlContent.includes(targetArticleInfo[0].link._text)
    })
    // clean up the opps for CSV output
    .filter(finalOpps => {
        delete finalOpps.htmlContent;
        delete finalOpps.values;
        delete finalOpps.sparseValues;
        delete finalOpps.metadata;
        return true;
    });Code language: JavaScript (javascript)

5. Profit!

Last but, most definitely not least is saving the link opportunities into a nice and tidy CSV file for you to do a final manually spot check and start building those internal links to:

// Save output as CSV
fs.writeFileSync('./output/opps-'+TARGET_ARTICLE_ID+'.csv', await json2csv.json2csv(finalOpp));
// Send success message
console.log(`There were ${finalOpp.length} link opportunities found for the URL ${ targetArticleInfo[0].link._text}`);Code language: JavaScript (javascript)

Closing Thoughts

With the testing that I conducted, I found that accuracy to be the biggest issue here. In that internal linking opportunities with high score thresholds were not topcially relevant while some opportunities that were flagged had higher relevancy.

A few ideas on improving the accuracy of your results could do implementing:

Filtering results by meta data
Vectorizing and searching by page title rather than body content
Using hybrid search and sparse vectors using weights

I hope you enjoyed the walkthrough and it got your gears turning on how you can use embeddings and vector search in your day to day SEO tasks.