GitHub Source Connector for Confluent Platform

The Kafka Connect GitHub Source Connector is used to write meta data (detect changes in real time or consume the history) from GitHub to Apache Kafka® topics. This connector polls data from GitHub through GitHub APIs, converts data into Kafka records, and then pushes the records into a Kafka topic. Each record from GitHub is converted into exactly one Kafka record.

Features

The GitHub Source connector offers the following features:

  • At Least Once Delivery: The connector guarantees that records from GitHub are delivered at least once to the Kafka topic.
  • API Rate Limit Awareness The connector stops fetching records from GitHub when the API rate limit is exceeded. Once the API rate limit resets, the connector will resume fetching records.
  • Supports HTTPS Proxy The connector can connect to GitHub using an HTTPS proxy server. To configure the proxy, you can set http.proxy.host, http.proxy.port, http.proxy.user and http.proxy.password in the configuration file. The connector has been tested with HTTPS proxy with basic authentication.

Limitations

  • For resources that do not support fetching records by datetime, new records are fetched at an interval specified by the request.interval.ms configuration. Records for these resources might get duplicated every time connector restarts.
  • The connector is not be able to detect the deletion of data on GitHub.
  • In the case of connector restarts, the Kafka topic might end up having records that are out of order.
  • GitHub has a defined API request limit. This limit is `5,000 requests <https://developer.github.com/apps/building-github-apps/understanding-rate-limits-for-github-apps/>`__ per hour. Once this rate limit is exceeded, the connector waits until the API request limit resets.

GitHub Resources

The GitHub connector supports fetching records from the following resources:

  • assignees: Available assignees for the specified repositories, refer the following schema.
  • collaborators: Collaborators for the specified repositories, refer the following schema.
  • issues: Issues in all GitHub states, refer the following schema.
  • comments: Issue comments, refer the following schema.
  • commits: Master branch commits only, refer the following schema.
  • pull_requests: Pull Requests in all GitHub states, refer the following schema.
  • releases: Release for the specified repositories, refer the following schema.
  • reviews: Reviews on pull requests. Reviews can only be fetched with Pull Requests, refer the following schema.
  • review_comments: Review comments on pull requests, refer the following schema.
  • stargazers: Stargazers for the specified repositories, refer the following schema.

Prerequisites

The following are required to run the Kafka Connect GitHub Source Connector:

  • Kafka Broker: Confluent Platform 3.3.0 or above.
  • Connect: Confluent Platform 4.1.0 or above.
  • Java 1.8
  • No additional setup is required on GitHub account for this connector to work, other than access token with repository and user privileges. See Creating a personal access token for the command line.

Install the GitHub Source Connector

You can install this connector by using the instructions or you can manually download the ZIP file.

Install the connector using Confluent Hub

Prerequisite
Confluent Hub Client must be installed. This is installed by default with Confluent Enterprise.

Navigate to your Confluent Platform installation directory and run the following command to install the latest (latest) connector version. The connector must be installed on every machine where Connect will run.

confluent-hub install confluentinc/kafka-connect-github:latest

You can install a specific version by replacing latest with a version number. For example:

confluent-hub install confluentinc/kafka-connect-github:1.0.0-preview

Install the connector manually

Download and extract the ZIP file for your connector and then follow the manual connector installation instructions.

License

You can use this connector for a 30-day trial period without a license key.

After 30 days, this connector is available under a Confluent enterprise license. Confluent issues Confluent enterprise license keys to subscribers, along with providing enterprise-level support for Confluent Platform and your connectors. If you are a subscriber, please contact Confluent Support at support@confluent.io for more information.

See Confluent Platform license for license properties and Confluent License Properties for information about the license topic

Configuration Properties

For a complete list of configuration properties for this connector, see GitHub Source Connector Configuration Properties.

Note

For an example of how to get Kafka Connect connected to Confluent Cloud, see Distributed Cluster.

Quick Start

In this quick start, you configure the GitHub Source connector to fetch Github users who have stared Apache Kafka repository since 2019-01-01 to a Kafka topic called github-stargazers.

Start Confluent

Start the Confluent services using the following Confluent CLI command:

confluent local services start

Important

Do not use the Confluent CLI in production environments.

Properties-based example

Create a file called github-source-quickstart.properties file with following properties:

name=MyGithubConnector
confluent.topic.bootstrap.servers=localhost:9092
confluent.topic.replication.factor=1
tasks.max=1
connector.class=io.confluent.connect.github.GithubSourceConnector
github.service.url=https://api.github.com
github.access.token=<ACCESS-TOKEN>
github.repositories=apache/kafka
github.tables=stargazers
github.since=2019-01-01
topic.name.pattern=github-${entityName}
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081

Next, load the Source Connector.

Caution

You must include a double dash (--) between the topic name and your flag. For more information, see this post.

.confluent local services connect connector load MyGithubConnector --config github-source-quickstart.properties

Your output should resemble the following:

{
    "name": "MyGithubConnector",
    "config": {
        "connector.class": "io.confluent.connect.github.GithubSourceConnector",
        "tasks.max": "1",
        "confluent.topic.bootstrap.servers":"localhost:9092",
        "confluent.topic.replication.factor":"1",
        "github.service.url":"https://api.github.com",
        "github.repositories":"apache/kafka",
        "github.tables":"stargazers",
        "github.since":"2019-01-01",
        "github.access.token":"<Your-Github-Access-Token>",
        "topic.name.pattern":"github-${entityName}",
        "key.converter":"io.confluent.connect.avro.AvroConverter",
        "key.converter.schema.registry.url":"http://localhost:8081",
        "value.converter":"io.confluent.connect.avro.AvroConverter",
        "value.converter.schema.registry.url":"http://localhost:8081"
    },
    "tasks": [],
    "type": null
}

Enter the following command to confirm that the connector is in a RUNNING state:

confluent local services connect connector status MyGithubConnector

The output should resemble:

{
   "name":"MyGithubConnector",
   "connector":
   {
      "state":"RUNNING",
      "worker_id":"127.0.1.1:8083"
   },
   "tasks":
   [
      {
         "id":0,
         "state":"RUNNING",
         "worker_id":"127.0.1.1:8083"
      }
   ],
   "type":"source"
}

REST-based example

Use this setting with distributed workers. Write the following JSON to config.json, configure all of the required values, and use the following command to post the configuration to one of the distributed Connect workers. Check here for more information about the Kafka Connect REST API.

{
   "name" : "MyGithubConnector",
   "config" :
   {
      "connector.class" : "io.confluent.connect.github.GithubSourceConnector",
      "confluent.topic.bootstrap.servers": "localhost:9092",
      "confluent.topic.replication.factor": "1",
      "tasks.max" : "1",
      "github.service.url":"https://api.github.com",
      "github.access.token":"< Github-Access-Token >",
      "github.repositories":"apache/kafka",
      "github.tables":"stargazers",
      "github.since":"2019-01-01",
      "topic.name.pattern":"github-${entityName}",
      "key.converter":"io.confluent.connect.avro.AvroConverter",
      "key.converter.schema.registry.url":"http://localhost:8081",
      "value.converter":"io.confluent.connect.avro.AvroConverter",
      "value.converter.schema.registry.url":"http://localhost:8081"
   }
}

Note

For staging or production use * Change the confluent.topic.bootstrap.servers property to include your broker address(es). * Change the confluent.topic.replication.factor to 3 for staging or production use. * Change http://localhost:8083/ to the endpoint of one of your Connect worker(s).

Use curl to post a configuration to one of the Connect workers.

curl -sS -X POST -H 'Content-Type: application/json' --data @config.json http://localhost:8083/connectors

Confirm that the connector is in a RUNNING state by running the following command:

curl http://localhost:8083/connectors/MyGithubConnector/status

The output should resemble the example below:

{
   "name":"MyGithubConnector",
   "connector":{
      "state":"RUNNING",
      "worker_id":"127.0.1.1:8083"
   },
   "tasks":[
      {
         "id":0,
         "state":"RUNNING",
         "worker_id":"127.0.1.1:8083"
      }
   ],
   "type":"source"
}

Enter the following command to consume records written by the connector to the Kafka topic:

./kafka-avro-console-consumer --bootstrap-server localhost:9092 --topic github-stargazers --from-beginning

The output should resemble the example below:

{
    "type": {
      "string": "STARGAZERS"
    },
    "createdAt": null,
    "data": {
      "data": {
        "login": {
          "string": "User.Name"
        },
        "id": {
          "int": 1234
        },
        "node_id": {
          "string": "MDQ6VXNlcjM0OTE3MTE="
        },
        "avatar_url": {
          "string": "https://avatars2.githubusercontent.com/u/1234?v=4"
        },
        "gravatar_id": {
          "string": ""
        },
        "url": {
          "string": "https://api.github.com/users/User.Name"
        },
        "html_url": {
          "string": "https://github.com/User.Name"
        },
        "followers_url": {
          "string": "https://api.github.com/users/User.Name/followers"
        },
        "following_url": {
          "string": "https://api.github.com/users/User.Name/following{/other_user}"
        },
        "gists_url": {
          "string": "https://api.github.com/users/User.Name/gists{/gist_id}"
        },
        "starred_url": {
          "string": "https://api.github.com/users/User.Name/starred{/owner}{/repo}"
        },
        "subscriptions_url": {
          "string": "https://api.github.com/users/User.Name/subscriptions"
        },
        "organizations_url": {
          "string": "https://api.github.com/users/User.Name/orgs"
        },
        "repos_url": {
          "string": "https://api.github.com/users/User.Name/repos"
        },
        "events_url": {
          "string": "https://api.github.com/users/User.Name/events{/privacy}"
        },
        "received_events_url": {
          "string": "https://api.github.com/users/User.Name/received_events"
        },
        "type": {
          "string": "User"
        },
        "site_admin": {
          "boolean": false
        }
      }
    },
    "id": {
      "string": "1234"
    }
  }

Clean up resources

Delete the connector

confluent local services connect connector unload MyGithubConnector

Stop Confluent Platform

confluent local stop

Additional Documentation