Big Crawler IPs is part of a much larger initiative of providing SEOs and marketers with open source data warehousing tools
As I dive deeper into the world of SEO whether it be with my own personal projects or with the clients that I work with, I am realizing that the need to have a data warehouse has become more and more important.
One data source that I’ve found to be a big pain to consistently have are log files as it typically require sever level access and a developer.
Thankfully, there is a super hand-dandy tool called LogFlare that sits on top of CloudFlare which takes care of all of the heavy lifting when it comes to collecting and storing log data in a BigQuery and thankfully BigQuery is my data warehouse of choice.
As amazing as LogFlare is, it’s simply a firehose of log data into your warehouse (the
loading part of ELT). There isn’t any filtering or transformation that happens as the onus is on you (as it should be).
Why Build This
So if you’re trying to build out an SEO data warehouse one of the first filtering steps you should be taking with log files is “authenticating” the data to make sure that it is actually GoogleBot or BingBot crawling your site versus a tool such as SiteBulb or ScreamingFrog.
To do this you have to rely on the IP address rather than the User-Agent. Thankfully both GoogleBot and BingBot provide you with a list of IP addresses.
So how do you get these IP addresses into your warehouse?
Well that is where Big Crawler IPs come into play.
How It Works
Once the Cloud Function is triggered, the official GoogleBot and BingBot IP address JSON files are read and then then cross referenced with the IPs already in BigQuery and then saves the missing ones. If an an IP address is missing from BigQuery it is then added to the table.
How To Deploy Your Own
Hope you find this tool handy and if you have any feedback feel free to reach out on Twitter (@JordanChoo)