Automated Data Scraping with Github Actions

Data Scraping without a Database

Dec 2020 Edit: You can see a live example of this in my own GitHub profile readme

Nov 2023 Edit: GitHub Actions contain a lot of footguns. be aware of them all and move YAML complexity into code

A common need I have in open source community work, especially with static site generators and the JAMstack, is scraping and updating data. For example, in the Svelte Community site we scrape the GitHub star count and last update, and ditto Gatsby Starters. Of course, you could grab data clientside, and whatever you can’t do clientside, you can throw up a serverless function to do this.

But sometimes it just makes sense to scrape data once instead of every time your users access your site, especially if that data requires tokens your users may not have. Typically you’d set up a cronjob and send the data into a database somewhere. With GitHub Actions, you can do this all inside GitHub, AND save a version controlled history of all data.

I noticed Mikeal Rogers doing exactly this for his Daily OSS watcher project, and so finally took some time to check out his code and make a minimal repro so others can take it as a base.

Demo

You can see my demo in action here: https://github.com/sw-yx/gh-action-data-scraping.

For those new to npm, there is a simple npm script defined in package.json. This is so you can manually run it while writing and testing your code. The action workflow calls this same exact action to reduce any discrepancies.

The Script

Straight to the point:

on:
  schedule:
    - cron:  '0 8 * * *' # 8am daily. Ref https://crontab.guru/examples.html
name: Scrape Data
jobs:
  build:
    name: Build
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@master # check out the repo this action is in, so that you have all prior data
    - name: Build
      run: npm install # any dependencies you may need
    - name: Scrape
      run: npm run action # actually run your npm script for scraping
      # env:
      #   WHATEVER_TOKEN: ${{ secrets.YOU_WANT }}
    - uses: mikeal/publish-to-github-action@master
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # GitHub sets this for you

The basic idea, in English, is:

That’s it! Look ma, no database!

As part of your workflow, you can also fire off a static site build after this action completes, or weekly, or whenever else you like.

Limits

You can do whatever you like with this, including taking screenshots of sites!

The limits I can think of are the limits of GitHub and GitHub Actions:

In addition to these limits, GitHub Actions should not be used for:

  • Content or activity that is illegal or otherwise prohibited by their Terms of Service or Community Guidelines.
  • Cryptomining
  • Serverless computing
  • Activity that compromises GitHub users or GitHub services.
  • Any other activity unrelated to the production, testing, deployment, or publication of the software project associated with the repository where GitHub Actions are used. In other words, be cool, don’t use GitHub Actions in ways you know you shouldn’t.

Be a good citizen, don’t abuse it and F this up for the rest of us!

More

I’m looking for more great usecases for GH actions:

Tagged in: #tech #ideas #open source

Leave a reaction if you liked this post! 🧡
Loading comments...
Webmentions
❤️ 0 💬 5
  • avatar of Lisa Miller
    Lisa Miller mentioned this on 2020-05-23

    Not sure how @swyx comes up with crazy stuff every single time! This one is neattt!!!

  • avatar of Brian Douglas
    Brian Douglas mentioned this on 2020-02-16

    Automated Data Scraping with GitHub Actions: Data Scraping without a Database swyx.io/writing/github…

  • avatar of Colby Fayock
    Colby Fayock mentioned this on 2020-02-11

    yeah fair enough

  • avatar of shawn swyx wang🤗
    shawn swyx wang🤗 mentioned this on 2020-02-11

    never say never, but nothing planned right now. lots of free cron services out there. and gh has got to be the easiest right now

  • avatar of Colby Fayock
    Colby Fayock mentioned this on 2020-02-11

    ha nice - hadnt thought about using github actions. i was hoping netlify functions eventually get cron-like support for events which could be useful for something like this (similar to what you can do with an aws lambda)

  • No further replies found. Tweet about this post and it will show up here!

Subscribe to the newsletter

Join >10,000 subscribers getting occasional updates on new posts and projects!

I also write an AI newsletter and a DevRel/DevTools newsletter.

Latest Posts

Search and see all content