Infrastructure Integration

Configuration

  1. Configure the agent by editing /etc/netsil-dd-agent/conf.d/spark.yamlin the collectors. Example:

    init_config:
    
    instances:
      #
      # The Spark check can retrieve metrics from Standalone Spark, YARN and
      # Mesos. All methods require the `spark_url` to be configured.
      #
      # For Spark Standalone, `spark_url` must be set to the Spark master's web
      # UI. This is "http://localhost:8080" by default.
      #
      # For YARN, `spark_url` must be set to YARN's resource manager address. The
      # ResourceManager host name can be found in the yarn-site.xml conf file
      # under the `property yarn.resourcemanager.address` The ResourceManager port
      # can be found in the yarn-site.xml conf file under the property
      # `yarn.resourcemanager.webapp.address`. This is "http://localhost:8088"
      # by default.
      #
      # For Mesos, `spark_url` must be set to the Mesos master's web UI. This is
      # "http://<master_ip>:5050" by default, where `<master_ip>` is the IP
      # address or resolvable host name for the Mesos master.
      #
      # The use of `resourcemanager_uri` has been deprecated, but is still functional.
      - spark_url: http://localhost:8088
    
        # To enable monitoring of a Standalone Spark cluster, the spark cluster
        # mode must be set. Uncomment the cluster mode that applies.
        # spark_cluster_mode: spark_yarn_mode
        # spark_cluster_mode: spark_standalone_mode
        # spark_cluster_mode: spark_mesos_mode
    
        # To use an older (versions prior to 2.0) Standalone Spark cluster,
        # the 'spark_pre_20_mode' must be set
        # spark_pre_20_mode: true
        #
        # If you have enabled the spark UI proxy, you may set this to `true`
        # spark_proxy_enabled: false
    
        # A Required friendly name for the cluster.
        # cluster_name: MySparkCluster
    
        # Optional tags to be applied to every emitted metric.
        # tags:
        #   - key:value
        #   - instance:production
    
  2. Check and make sure that all yaml files are valid with following command:

    /etc/init.d/netsil-collectors configcheck
    
  3. Restart the Agent using the following command:

    /etc/init.d/netsil-collectors restart
    
  4. Execute the info command to verify that the integration check has passed:

    /etc/init.d/netsil-collectors info
    

The output of the command should contain a section similar to the following:

    Checks
    ======

      [...]

      spark
      -----
          - instance #0 [OK]
          - Collected 8 metrics & 0 events

Infrastructure Datasources

Datasource Available Aggregations Unit Description
spark.job.num_tasks avg max min sum task/second Number of tasks in the application
spark.job.num_active_tasks avg max min sum task/second Number of active tasks in the application
spark.job.num_skipped_tasks avg max min sum task/second Number of skipped tasks in the application
spark.job.num_failed_tasks avg max min sum task/second Number of failed tasks in the application
spark.job.num_active_stages avg max min sum stage/second Number of active stages in the application
spark.job.num_completed_stages avg max min sum stage/second Number of completed stages in the application
spark.job.num_skipped_stages avg max min sum stage/second Number of skipped stages in the application
spark.job.num_failed_stages avg max min sum stage/second Number of failed stages in the application
spark.stage.num_active_tasks avg max min sum task/second Number of active tasks in the application's stages
spark.stage.num_complete_tasks avg max min sum task/second Number of complete tasks in the application's stages
spark.stage.num_failed_tasks avg max min sum task/second Number of failed tasks in the application's stages
spark.stage.executor_run_time avg max min sum fraction Fraction of time (ms/s) spent by the executor in the application's stages
spark.stage.input_bytes avg max min sum byte/second Input bytes in the application's stages
spark.stage.input_records avg max min sum record/second Input records in the application's stages
spark.stage.output_bytes avg max min sum byte/second Output bytes in the application's stages
spark.stage.output_records avg max min sum record/second Output records in the application's stages
spark.stage.shuffle_read_bytes avg max min sum byte/second Number of bytes read during a shuffle in the application's stages
spark.stage.shuffle_read_records avg max min sum record/second Number of records read during a shuffle in the application's stages
spark.stage.shuffle_write_bytes avg max min sum byte/second Number of shuffled bytes in the application's stages
spark.stage.shuffle_write_records avg max min sum record/second Number of shuffled records in the application's stages
spark.stage.memory_bytes_spilled avg max min sum byte/second Number of bytes spilled to disk in the application's stages
spark.stage.disk_bytes_spilled avg max min sum byte/second Max size on disk of the spilled bytes in the application's stages
spark.driver.rdd_blocks avg max min sum block/second Number of RDD blocks in the driver
spark.driver.memory_used avg max min sum byte/second Amount of memory used in the driver
spark.driver.disk_used avg max min sum byte/second Amount of disk used in the driver
spark.driver.active_tasks avg max min sum task/second Number of active tasks in the driver
spark.driver.failed_tasks avg max min sum task/second Number of failed tasks in the driver
spark.driver.completed_tasks avg max min sum task/second Number of completed tasks in the driver
spark.driver.total_tasks avg max min sum task/second Number of total tasks in the driver
spark.driver.total_duration avg max min sum fraction Fraction of time (ms/s) spent by the driver
spark.driver.total_input_bytes avg max min sum byte/second Number of input bytes in the driver
spark.driver.total_shuffle_read avg max min sum byte/second Number of bytes read during a shuffle in the driver
spark.driver.total_shuffle_write avg max min sum byte/second Number of shuffled bytes in the driver
spark.driver.max_memory avg max min sum byte/second Maximum memory used in the driver
spark.executor.rdd_blocks avg max min sum block/second Number of persisted RDD blocks in the application's executors
spark.executor.memory_used avg max min sum byte/second Amount of memory used for cached RDDs in the application's executors
spark.executor.disk_used avg max min sum byte/second Amount of disk space used by persisted RDDs in the application's executors
spark.executor.active_tasks avg max min sum task/second Number of active tasks in the application's executors
spark.executor.failed_tasks avg max min sum task/second Number of failed tasks in the application's executors
spark.executor.completed_tasks avg max min sum task/second Number of completed tasks in the application's executors
spark.executor.total_tasks avg max min sum task/second Total number of tasks in the application's executors
spark.executor.total_duration avg max min sum fraction Fraction of time (ms/s) spent by the application's executors executing tasks
spark.executor.total_input_bytes avg max min sum byte/second Total number of input bytes in the application's executors
spark.executor.total_shuffle_read avg max min sum byte/second Total number of bytes read during a shuffle in the application's executors
spark.executor.total_shuffle_write avg max min sum byte/second Total number of shuffled bytes in the application's executors
spark.executor_memory avg max min sum byte/second Maximum memory available for caching RDD blocks in the application's executors
spark.rdd.num_partitions avg max min sum /second Number of persisted RDD partitions in the application
spark.rdd.num_cached_partitions avg max min sum /second Number of in-memory cached RDD partitions in the application
spark.rdd.memory_used avg max min sum byte/second Amount of memory used in the application's persisted RDDs
spark.rdd.disk_used avg max min sum byte/second Amount of disk space used by persisted RDDs in the application