RSS feed

Better Meta

A 3D representation
          a Google weeb crawler inspecting web pages with a magnifying glass

I have another website, that I manage with CMS (Content Management Software). I haven't bothered setting up anything for Google or Bing indexing for that site yet, I was waiting to see if I would keep using the site first. But content I was putting on that site was getting indexed anyway, and it also looked good in the search results. So I decided to investigate why this was the case.

SEO. Isn't it always the answer? I put my hands up, I'm not an expert on it, it's quite a broad subject. I prefer playing in a terminal. But if a website can get itself indexed on it's own, to a high standard and without any human hands getting involved, then it's worth knowing about.

NOTE: Keep in mind, this relies on web crawler bots coming to your site. So while it automates page indexing, if you want things done fast you may still want to use other methods.

Robots

Let's start with the simplest thing first. The robots.txt file. I've only read about this file being used to limit web crawler activity on a website. It seems obvious now I've seen it, but if it can tell robots to go away, then it can also tell them to come in.

User-agent: *
Disallow:
Sitemap: https://aaronwatts.dev/sitemap.xml

There we go: all crawlers welcome, here's my site directory. I had been using an XSL template for my sitemap previously, which has since broken on Chromium. Chromium is in the process of removing XSL support from the blink web engine, I'm not sure if that's likely to effect certain web crawlers too, but just to be safe I have removed the XSL instruction from the XML sitemap file.

OpenGraph

The OpenGraph protocol is a standard that allows relevant information to be picked out by social media applications to generate thumbnails. Not dissimilar from the standard meta description tag, but richer.

Making use of the tool is just a case of adding the necessary meta elements to your document head.

<html>
  <head>
    <meta property="og:title" content="Better Meta"/>
    <meta property="og:site_name" content="aaronwatts@dev"/>
    <meta property="og:description" content="intro text here"/>
    <meta property="og:url" content="https://aaronwatts.dev/guides/better-meta"/>
    <meta property="og:image" content="https://aaronwatts.dev/images/guides/better-meta.avif"/>
    <meta property="og:type" content="article"/>
    <meta property="og:article:published_time" content="2026-03-05"/>
  </head>
  ...
</html>

The content required for the tags already all exists somewhere within the document itself, so this can all be done programmatically, which I will look at shortly. Automating the process also redduces the risk of breaking something with a typo.

Twitter has an OpenGraph property for it's cards, which I am not currently implementing here, but will do soon. I have used this page as an example, which would count as an article, but the schema is vast, and it's worth consulting the documentation for what kind of clues you can provide, about what type of content your material is, to other sites that might display links to your site.

Schema

Similar to OpenGraph, there are also rich content identifiers specifically for search engines. Google, along with other search engines, use the schema.org vocabulary.

The easiest way to work with schema is using json-ld (JSON Linked Data). Some schema classifications have required attributes, and Google's developer documentation has helpful guidance and what ought to be included, and what is complimentary.

<html>
  <head>
    <script type="application/ld+json">{
      "@context": "http://schema.org",
      "@type": "Article",
      "mainEntityOfPage": {
        "@type": "WebPage",
        "@id": "https://aaronwatts.dev/guides/better-meta"
      },
      "headline": "Better Meta",
      "datePublished": "2025-03-05",
      "description": "intro text here",
      "image": [
        "https://aaronwatts.dev/images/guides/better-meta.avif"
      ],
      "author": {
        "@type": "Person",
        "name": "aaronwatts@dev"
      }
    }</script>
  </head>
  ...
</html>

To implement a schema using json-ld, you just need a script tag in the head of the document, with a type attribute of application/ld+json. As with the data required to populate the OpenGraph content, all the required data is contained somewhere within the document, so it can be done programmatically.

Automating the Process

If you are using a CMS, this stuff is likely being handled for you. But if not, it can be automated. How you acheive this will depend on the tech stack you use for a given website, and your preferred or available technologies. A PHP-like templating engine will be the simplest way to automate. For my use case, I have already made a CMS module in Python, and so have implemented the automation into that. Without looking too deeply into my CMS module, I'll outline how I have automated updating the meta for my articles using Python.

<head>
  <meta charset="UTF-8"/>
  <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
  <meta name="description"/>
  ...
  <meta property="og:title"/>
  <meta property="og:site_name" content="aaronwatts@dev"/>
  <meta property="og:description"/>
  <meta property="og:url"/>
  <meta property="og:image"/>
  <meta property="og:type" content="article"/>
  <meta property="og:article:published_time"/>

  <script type="application/ld+json"></script>
</head>

Before I even run any Python code, my HTML documents for my articles will already contain the necessary OpenGraph meta tags and an empty script tag with a type of application/ld+json. Anything that is page specific will not have any data applied yet for the content attributes. I am also programmatically populating the content attribute for the standard meta description too for consistency. When I run my Python CMS module, it will first run a series of tests on each document and make sure the tags it is expecting to work with are all present in the document. If anything is missing, or other tests fail, it will exit the program and log what tests have failed.

def extract_data(doc):
  doc['title'] = doc['soup'].select_one(SELECTORS['title']).text
  doc['intro'] = doc['soup'].select_one(SELECTORS['intro']).text
  doc['description'] = " ".join(doc['intro'].split())
  doc['html_date'] = doc['soup'].select_one(SELECTORS['time'])
  doc['date_attr'] = doc['html_date']['datetime']
  doc['URL'] = path_to_url(f"{doc['directory']}/{doc['filename']}")
  doc['IMG'] = BASE_URL + selector(doc['soup'], "article img", "src")
  return doc

My CMS already iterates through documents to extract the required data to be able to create the RSS feeds, and add new articles to the index pages. So I already have all my data readily available. There are custom functions here to build a url from a specified file path, and a selector function to get an attributes content (I am not using that function to get the datetime attribute as I also use the html_date to format dates to RFC 822 for the RSS feed). My CSS selectors are all pulled out of a config file, so if I decide to change my document structure, I just need to alter the config file and not the code itself. The doc being passed into the function is a dictionary for each article document, with keys and values for filename, directory, and soup.

NOTE: I've only included the relevant extractions in the function above. More data gets extracted from each document, but they are not relevant to building the meta for the page and are instead used elsewhere, such as the RSS feeds.

def add_meta_content(article):
  head = selector(article['soup'], "head")

  description = selector(head, "meta[name='description']")
  description['content'] = article['description']

  og_title = selector(head, "meta[property='og:title']")
  og_title['content'] = article['title']
  og_desc = selector(head, "meta[property='og:description']")
  og_desc['content'] = article['description']
  og_url = selector(head, "meta[property='og:url']")
  og_url['content'] = article['URL']
  og_image = selector(head, "meta[property='og:image']")
  og_image['content'] = article['IMG']
  og_time = selector(head, "meta[property='og:article:published_time']")
  og_time['content'] = article['date_attr']

  json_ld = selector(head, "script[type='application/ld+json']")
  json_ld_data = {
    "@context": "http://schema.org",
    "@type": "Article",
    "mainEntityOfPage":{
      "@type": "WebPage",
      "@id": article['URL']
    },
    "headline": article['title'],
    "datePublished": article['date_attr'],
    "description": article['description'],
    "image": [
      article['IMG']
        ],
    "author":{
          "@type": "Person",
          "name": "aaronwatts@dev"
    }
  }

  json_ld_str = json.dumps(json_ld_data, indent=2)
  json_ld.string = json_ld_str

The extended article data is then passed into a function that populates the required meta tags with the applicable data. The Python json module is used to convert a Python dictionary into a JSON string.

article_path = f"{ROOT}/{article['directory']}/{article['filename']}"
write_to_html(article['soup'], article_path)

def write_to_html(soup, path):
  with open(path, "w") as outf:
    outf.write(str(soup))

Each document's soup retains all adjustments made, so the final step is simply to write the modified soup back into the file it has been taken from.