A great source of data for any great SEO professional is the search engine crawl data found within log files. You can analyze log files to learn exactly how search engines are crawling and understanding your website. This gives you insight that no third-party SEO tool can. If you want to save both time and money, try automating the parsing, validation, and pivoting of log file data for SEO with Python. In this article, we will walk you through the steps for how to parse and pivot SEO files with Python.
Why Should Should Use Python to Parse and Pivot SEO Files
Python is a multipurpose programming language that has a wide variety of applications when it comes to data, web development, and executing algorithms. Using Python to parse and pivot your SEO data from log files helps you to:
- Validate your conclusions by giving you concrete evidence of how search engines are crawling and seeing your website
- Prioritize your findings by helping you see the scale of a problem and how much fixing it can help
- Find any other problems that you can’t see in other data sources
Even though there are several benefits of using log file data, many SEO experts stay away from it for a variety of reasons. For one, finding the data typically means going through a dev team, which can take a lot of time. Also, the raw files can be large and hard to understand, which makes it difficult to parse the data. Finally, the cost of tools designed to make the process simpler might be too high.
While these are all valid concerns, there is another way. If you have some knowledge of coding and scripting languages, you can automate the process. We will walk you through the steps for how to use Python to analyze server logs for SEO.
How to Use Python to Parse and Pivot SEO Files
Before you get started, you will need to consider which format you want to use to parse log file data. You have several options, like Apache, Nginx, and IIS. Plus, many websites use CDN providers now, like Cloudflare, Cloudfront, and Akamai, to serve up content from the closest edge location to a user.
In this article, we will focus on the Combined Log Format. That’s because the Combine Log Format is the default for Nginx and typically the choice on Apache servers.
If you don’t know what kind of format you are dealing with, there are services like Builtwith and Wappalyzer that can tell you about a website’s tech stack.
If you’re still unsure, simply open one of the raw files. Cross-reference the comments provided with information on specific fields.
You will also need to consider which search engine you will want to include. In this article, we will focus on Google, because Google is the most dominant search engine with more than 92% of the global market share.
1. Identify Files and Determine Formats
In order to perform a consequential SEO analysis, you will need a minimum of about 100,000 requests and between two and four weeks’ worth of data for a typical website. Because of the sizes of the files, logs are typically split into individual days. You will most likely receive multiple files to process.
Since you don’t know how many files you’ll be dealing with unless you combine them before running the script, the first step is to generate a list of all of the files in our folder using the glob module. Then, you can return any file matching a pattern that we specify. The follow code will match any TXT file:
files = glob.glob(‘*txt.’)
However, not all files are TXT. Log files can come in multiple kinds of file formats. You might not even recognize the file extension.
It’s also possible that the files you receive will be split across multiple subfolders, and we don’t want to waste time copying them into a single location. Luckily, glob supports both recursive searches and wildcard operators. That means you can generate a list of all the files within a subfolder or child subfolders.
files = glob.glob(‘**/*.*’, recursive=True)
Next, you want to identify which types of files are within your list. To do this, the MIME type of the specific file can be detected. This tells you what kind of file you’re dealing with, no matter the extension.
You can do this by using python-magic, a wrapper around the libmagic C library, and creating a simple function.
pip install python-magic
pip install libmagic
mime = magic.from_file(file_path, mime=True)
Next, you can use list comprehension to loop through your files and apply the function, creating a dictionary to store both the names and types.
file_types = [file_type(file) for file in files]
file_dict = dict(zip(files, files_types))
Finally, use a function and a while loop to extract a list of files that return a MIME type of text/plain and exclude anything else.
uncompressed = 
for key, value in file_dict.items():
if file in value:
file_identifier(‘text/plain’) in file_dict
2. Extract Search Engine Requests
After filtering down the files in your folder or folders, your next step is to filter the files themselves by only extracting the requests that you care about.
This eliminates the need to combine the files using command-line utilities like GREP or FINDSTR, and saves you time searching for the right command.
In this case, since you only want Googlebot requests, you will search for “Googlebot” to to match all of the relevant user agents.
You can use Python’s open function to read and/or write your file and Python’s regex module, RE, to perform the search.
3. Parse Requests
There are multiple ways you can parse requests. For the sake of this article, we will use Pandas’ inbuilt CSV parser and some basic data processing functions in order to:
- Drop unnecessary columns
- Format the timestamp
- Create a column with full URLs
- Rename and reorder the remaining columns
Instead of hardcoding a domain name, you can use the input function to prompt the user and save it as a variable.
4. Validate Requests
It is very simple to spoof search engine user agents, making request validation a vital part of the process, which stops you from drawing incorrect conclusions by analyzing your own third-party crawls.
In order to do this, you need to install a library called dnspython and perform a reverse DNS. You can use pandas to drop duplicate IPs and run the lookups on the smaller DataFrame. Then, reapply the results and filter any invalid requests.
This approach greatly increases the speed of the lookups and validates millions of requests in just minutes.
5. Pivot the Data
After validation, you have a cleansed and easy to understand set of data. You can begin pivoting this data to analyze points of interest more easily.
You can start with simple aggregation using Pandas’ groupby and agg functions to performa count of the number of requests for different status codes.
status_code = logs_filtered.groupby(‘Status Code’).agg(‘size’)
In order to replicate the type of count you use in Excel, you need to specify an aggregate function of ‘size,’ not ‘count’. If you use count, you will invoke the function on all columns within the DataFrame, and null values are handled differently. Resetting the index will restore the headers for both columns, and the latter column can be renamed to something more relevant.
If you want more advanced data manipulation, Pandas’ inbuilt pivot tables offer functionality comparable to Excel, which makes complex aggregations possible with just a single line of code. At its most basic level, the function requires a specified DataFrame and index (or indexes if a multi-index is required) and returns the corresponding values.
For greater specificity, the required values can be declared and aggregations (sum, mean, etc.) can be applied using the aggfunc parameter. The columns parameter can also help you display values horizontally for clearer output.
For data points like bytes, which might have many different numerical values, you will want to bucket the data. In order to do so, define your intervals within a list and then use the cut function to sort the values into bins, specifying np.inf to catch anything about the max value declared.
Finally, you need to export your log data and pivots. In order to make it easier to analyze, you will want to export your data to an Excel file rather than CSV. XLSX files support multiple sheets, which means you can combine all the DataFrames in one file. You can achieve this using to excel. You will need to specify an ExcelWriter object because you are adding more than one sheet to the same workbook. Plus, when you are exporting a large number of pivots, it helps to simplify things by storing DataFrames and sheet names in a dictionary and using a for loop.
Keep in mind that Excel’s row limit is 1,048,576. Since you are exporting every request, this might cause issues if you have large samples. CSV files have no limit, so an if statement can be employed to add in a CSV export as a fallback. Then, if the length of the log file DataFrame is more than 1,048,576, it will be exported as a CSV. This prevents the script from failing and still combines the pivots into a single export.
Parsing and Pivoting SEO Files with SEO Design Chicago
Though using Python to parse and pivot SEO files is a useful skill, it is also a difficult one to master. If you need assistance with your SEO files, contact the SEO data experts at SEO Design Chicago today for assistance!
- What is Python?
- How do I use Python to parse and pivot my SEO files?
- Why should I use Python to parse and pivot my SEO files?
- How do I extract search engine requests using Python?
- Do I need to know how to code to parse and pivot my SEO files?