We’ve prepared a two-part translation of Ryan Sears’s article on handling Google’s Certificate Transparency logs . The first part gives an overview of the structure of the logs and provides a sample Python code for parsing records from these logs. The second part is devoted to obtaining all certificates from the available logs and setting up the Google BigQuery system for storing and organizing searches for the received data.

Three years have passed since the original was written, and since then the number of available logs and, accordingly, entries in them has increased many times. It is all the more important to correctly approach the processing of logs if the goal is to maximize the amount of data received.

Part 1. Parsing Certificate Transparency Logs Like a Boss

During the development of our first project, phisfinder , I spent a lot of time thinking about the anatomy of phishing attacks and the data sources that would allow us to identify traces of upcoming phishing campaigns before they can cause any real damage.

One of the sources we’ve integrated (and definitely one of the best) is Certificate Transparency Log (CTL), a project started by Ben Laurie and Adam Langley at Google. Essentially, a CTL is a log containing an immutable list of certificates issued by a CA, which is stored in a Merkle tree, allowing each certificate to be cryptographically verified if necessary.

To understand how much data we will have to deal with, let’s see how many entries are contained in each log from the list from the CTL website:

import requests
import json
import locale
locale.setlocale(locale.LC_ALL, 'en_US')

ctl_log = requests.get(
    'https://www.gstatic.com/ct/log_list/log_list.json'
).json()

total_certs = 0

human_format = lambda x: locale.format('%d', x, grouping=True)

for log in ctl_log['logs']:
    log_url = log['url']
    try:
        log_info = requests.get(
            'https://{}/ct/v1/get-sth'.format(log_url),
            timeout=3
        ).json()
        total_certs += int(log_info['tree_size'])
    except:
        continue

    print("{} has {} certificates".format(
        log_url,
        human_format(log_info['tree_size'])
    ))

print("Total certs -> {}".format(human_format(total_certs)))

At the output we get:

ct.googleapis.com/pilot has 92,224,404 certificates
ct.googleapis.com/aviator has 46,466,472 certificates
ct1.digicert-ct.com/log has 1,577,183 certificates
ct.googleapis.com/rocketeer has 89,391,361 certificates
ct.ws.symantec.com has 3,562,198 certificates
ctlog.api.venafi.com has 94,797 certificates
vega.ws.symantec.com has 200,401 certificates
ctserver.cnnic.cn has 5,081 certificates
ctlog.wosign.com has 1,387,492 certificates
ct.startssl.com has 293,374 certificates
ct.googleapis.com/skydiver has 1,249,079 certificates
ct.googleapis.com/icarus has 48,585,765 certificates
Total certs -> 285,037,607

285,037,607 at the time of writing. This is not such a large amount of data, but you still have to make some effort to effectively organize storage and retrieval of certificates. More on this in the second part.Translator’s comment

Anatomy CTL

Receiving records from CTL is done over HTTP, which will allow us to easily receive data using modern libraries. Unfortunately, the data in the records themselves are confusing binary structures, which complicates the parsing process somewhat. An example of a log entry:

// curl -s 'https://ct1.digicert-ct.com/log/ct/v1/get-entries?start=0&end=0' | jq .
{
  "entries": [
    {
      "leaf_input": "AAAAAAFIyfaldAAAAAcDMIIG/zCCBeegAwIBAgI...",
      "extra_data": "AAiJAAS6MIIEtjCCA56gAwIBAgIQDHmpRLCMEZU..."
    }
  ]
}

Each record contains fields leaf_inputand extra_datain base64 format. Referring to RFC6962, we see that leaf_input– the encoded structure MerkleTreeLeaf , and extra_data– PrecertChainEntry .

About PreCerts

It took me quite a long time to figure out what PreCert is all about (you can try it yourself, read the RFC , and, apparently, I’m not the only one . I will save you a lot of time thinking and searching in Google and formulate the purpose of PreCerts as follows:

PreCerts are a separate type of certificate issued by a CA before it issues a “real” certificate. In fact, this is a copy of the original certificate, but contains a special x509 v3 extension called poison and marked critical. Thus, the certificate will not be validated by platforms that recognize this extension and know that it is PreCert, or by platforms that do not recognize this extension.

My experience in information security suggests that such a measure is not very effective, if only because bugs in x509 / ASN.1 parsing are quite common and individual implementations may be vulnerable to various shenanigans that will ultimately allow PreCert to be validated. I understand why this was done, but it seems that completely removing PreCerts and leaving only the certificates actually issued by the CA in the CTL would be much wiser.

Parse binary structures

As a person engaged in reverse engineering and from time to time participating in various CTFs, the task of parsing binary structures is not new to me. Most people refer to the struct module in such cases , but many years ago, while working for Phillip Martin , he introduced me to the excellent Construct library , which makes parsing such structures much easier. Below are the structures I used for parsing, as well as an example of their use to process records:

from construct import \
    Struct, Byte, Int16ub, Int64ub, Enum, Bytes, Int24ub, this, \
    GreedyBytes, GreedyRange, Terminated, Embedded

MerkleTreeHeader = Struct(
    "Version"         / Byte,
    "MerkleLeafType"  / Byte,
    "Timestamp"       / Int64ub,
    "LogEntryType"    / Enum(Int16ub, X509LogEntryType=0, PrecertLogEntryType=1),
    "Entry"           / GreedyBytes
)

Certificate = Struct(
    "Length" / Int24ub,
    "CertData" / Bytes(this.Length)
)

CertificateChain = Struct(
    "ChainLength" / Int24ub,
    "Chain" / GreedyRange(Certificate),
)

PreCertEntry = Struct(
    "LeafCert" / Certificate,
    Embedded(CertificateChain),
    Terminated
)
import json
import base64

import ctl_parser_structures

from OpenSSL import crypto

entry = json.loads("""
{
  "entries": [
    {
      "leaf_input": "AAAAAAFIyfaldAAAAAcDMIIG/zCCBeegAwIBAgIQ...",
      "extra_data": "AAiJAAS6MIIEtjCCA56gAwIBAgIQDHmpRLCMEZUg..."
    }
  ]
}
""")['entries'][0]

leaf_cert = ctl_parser_structures.MerkleTreeHeader.parse(
    base64.b64decode(entry['leaf_input'])
)

print("Leaf Timestamp: {}".format(leaf_cert.Timestamp))
print("Entry Type: {}".format(leaf_cert.LogEntryType))

if leaf_cert.LogEntryType == "X509LogEntryType":
    # В случае, если запись - обычный X509 сертификат
    cert_data_string = ctl_parser_structures.Certificate.parse(
        leaf_cert.Entry).CertData
    chain = [
        crypto.load_certificate(crypto.FILETYPE_ASN1, cert_data_string)
    ]

    # Парсим структуру `extra_data`
    # чтобы получить оставшуюся часть цепочки
    extra_data = ctl_parser_structures.CertificateChain.parse(
        base64.b64decode(entry['extra_data'])
    )
    for cert in extra_data.Chain:
        chain.append(
            crypto.load_certificate(crypto.FILETYPE_ASN1, cert.CertData)
        )
else:
    #  В случае, если запись - PreCert
    extra_data = ctl_parser_structures.PreCertEntry.parse(
        base64.b64decode(entry['extra_data'])
    )
    chain = [
        crypto.load_certificate(
            crypto.FILETYPE_ASN1, extra_data.LeafCert.CertData
        )
    ]

    for cert in extra_data.Chain:
        chain.append(
            crypto.load_certificate(crypto.FILETYPE_ASN1, cert.CertData)
        )

We get an array of X509 certificates from the chain with the certificate from leaf_inputas the first element

As you can see, Construct makes it pretty easy to define binary structures in Python.

Now that we understand what CTL is and how to parse individual records, we can move on to the second part – getting and saving all records from the logs with the possibility of further searching by certificates.

Part 2. Retrieving, Storing and Querying 250M + Certificates Like a Boss

Collecting certificates

According to the RFC , an endpoint is used to retrieve log entries get-entries. Unfortunately, the task is complicated by the limitation on the maximum number of records that can be obtained in one request (controlled by the startand parameters end), and most logs allow only 64 records to be received at a time. However, Google’s CTLs, which make up the majority of all logs, use a maximum query size of 1024 entries.Translator’s comment

Since the task is simultaneously IO-bound (receiving records via http) and CPU-bound (parsing certificates), for efficient processing it will be necessary to connect both asynchrony and multiprocessing.

Since there were no tools that would allow you to easily and painlessly get and parse all CTLs (apart from the not particularly remarkable utility from Google , it was decided to spend a little time and write a tool that would meet all our needs. The result was Axeman , which uses asyncio and the wonderful aioprocessing library for loading, parsing and saving certificates to multiple CSV files, limited only by the speed of the Internet connection.

Cloud exploitation

After receiving an instance (_approx.translated_ as VMs are called in Google Cloud) with 16 cores, 32GB of memory and a 750GB SSD (thanks to Google for the free $ 300 on the account for new accounts!), I launched Axeman, which downloaded all the certificates less than a day and saved the results in /tmp/certificates/$CTL_DOMAIN/

Where is all this data stored?

Initially, Postgres was chosen to store and search the data, but although I have no doubt that with the correct schema Postgres could easily handle 250 million records (unlike my first attempt, which took about 20 minutes!), I started looking for solutions that:

  • allow cheap storage of large amounts of data
  • provide quick search
  • allow easy data updates

There were several options, but in terms of cost, almost all of the options considered (AWS RDS, Heroku Postgres, Google Cloud SQL) were very expensive. Fortunately, since our data never changes in principle, we have additional flexibility in choosing a platform for data placement.

In general, this is exactly the type of data search that fits perfectly with a map / reduce model using, for example, Spark or Hadoop Pig. Looking through the offers of various providers in the “big data” category (although there is clearly not enough data in our task to be included in this category), I came across Google BigQuery, which meets all the indicated parameters.

Feeding BigQuery data

Loading data into BigQuery is pretty easy, thanks to Google’s gsutil utility . We create a new bucket for our certificates:

When the bucket is ready, we use gsutil to transport all the certificates to the Google store (and then BigQuery). After setting up the account with the command gsutil config, we start the download process:

gsutil -o GSUtil:parallel_composite_upload_threshold=150M \
       -m cp \
       /tmp/certificates/* \
       gs://all-certificates

And we see the following result in our bucket:

Next, we create a new dataset in BigQuery:

Now we can import data from the repository into our new dataset. Unfortunately, BigQuery does not have a “please import all folders recursively” button, so you have to import each CTL separately, but it doesn’t take that long. Create a table and import our first log (pay special attention to the marked settings):

Since the scheme is needed every time the next log is imported, let’s use the “Edit as Text” option. The scheme used:

[
    {
        "name": "url",
        "type": "STRING",
        "mode": "REQUIRED"
    },
    {
        "mode": "REQUIRED",
        "name": "cert_index",
        "type": "INTEGER"
    },
    {
        "mode": "REQUIRED",
        "name": "chain_hash",
        "type": "STRING"
    },
    {
        "mode": "REQUIRED",
        "name": "cert_der",
        "type": "STRING"
    },
    {
        "mode": "REQUIRED",
        "name": "all_dns_names",
        "type": "STRING"
    },
    {
        "mode": "REQUIRED",
        "name": "not_before",
        "type": "FLOAT"
    },
    {
        "mode": "REQUIRED",
        "name": "not_after",
        "type": "FLOAT"
    }
]

Then we just repeat the process for each log. Make sure each import succeeds (errors can usually be ignored, just make sure you set adequate values ​​for the maximum number of errors). As a result, you should get something like the following dataset:

What happened in the end

Now is the time to reap the benefits of our labors and test the system on various requests.

Recently, there is a lot of talk about domains using punycode and related homoglyph attacks . Let’s try the following query:

SELECT
  all_dns_names
FROM
  [ctl-lists:certificate_data.scan_data]
WHERE
  (REGEXP_MATCH(all_dns_names,r'\b?xn\-\-'))
  AND NOT all_dns_names CONTAINS 'cloudflare'

And in just 15 seconds we get the result with all punycode domains from all known CTLs!

Let’s look at another example. Let’s try to get all Coinbase domain certificates recorded in Certificate Transparency:

SELECT
  all_dns_names
FROM
  [ctl-lists:certificate_data.scan_data]
WHERE
  (REGEXP_MATCH(all_dns_names,r'.*\.coinbase.com[\s$]?'))

In just two seconds, we get all the results we are interested in:

The ability to easily perform this kind of analysis on such a large dataset is a powerful tool for identifying various trends that would otherwise not be possible to detect.

Small riddle

While doing my research, I discovered something strange. The domain flowers-to-the-world.com constantly appeared in various logs. Almost every log had a huge number of certificates containing this domain:

SELECT
  url,
  COUNT(*) AS total_certs
FROM
  [ctl-lists:certificate_data.scan_data]
WHERE
  (REGEXP_MATCH(all_dns_names,r'.*flowers-to-the-world.*'))
GROUP BY
  url
ORDER BY
  total_certs DESC

Whois makes it possible to determine that this domain belongs to Google, so I’m wondering if this is part of some testing routine. If you are a Google engineer who can find out from fellow Certificate Transparency fellows, it would be very interesting to hear about it.Google engineer answer in the comments below the original post


Very soon we are planning to release our own product – NetLas.io . This is a kind of technical atlas of the entire Internet, which will include not only certificates, but also data on domains and subdomains, server responses on popular ports, and many other information useful for security researchers.

In Russia, as far as we know, this is the first such product. We have strong competitors in the US and China, but we hope to outperform them in some ways. For example, the relevance of data – already now, our implementation of the search engine allows us to include data in the search results, from scans no older than a minute. Today Netlas.io is available in “early access” format. If you want to test – go to the site and register.