Monitor simulation#

Flower allows you to monitor system resources while running your simulation. Moreover, the Flower simulation engine is powerful and enables you to decide how to allocate resources per client manner and constrain the total usage. Insights from resource consumption can help you make smarter decisions and speed up the execution time.

The specific instructions assume you are using macOS and have the Homebrew package manager installed.

Downloads#

brew install prometheus grafana

Prometheus is used for data collection, while Grafana will enable you to visualize the collected data. They are both well integrated with Ray which Flower uses under the hood.

Overwrite the configuration files (depending on your device, it might be installed on a different path).

If you are on an M1 Mac, it should be:

/opt/homebrew/etc/prometheus.yml
/opt/homebrew/etc/grafana/grafana.ini

On the previous generation Intel Mac devices, it should be:

/usr/local/etc/prometheus.yml
/usr/local/etc/grafana/grafana.ini

Open the respective configuration files and change them. Depending on your device, use one of the two following commands:

# M1 macOS
open /opt/homebrew/etc/prometheus.yml

# Intel macOS
open /usr/local/etc/prometheus.yml

and then delete all the text in the file and paste a new Prometheus config you see below. You may adjust the time intervals to your requirements:

global:
  scrape_interval: 1s
  evaluation_interval: 1s

scrape_configs:
# Scrape from each ray node as defined in the service_discovery.json provided by ray.
- job_name: 'ray'
  file_sd_configs:
  - files:
    - '/tmp/ray/prom_metrics_service_discovery.json'

Now after you have edited the Prometheus configuration, do the same with the Grafana configuration files. Open those using one of the following commands as before:

# M1 macOS
open /opt/homebrew/etc/grafana/grafana.ini

# Intel macOS
open /usr/local/etc/grafana/grafana.ini

Your terminal editor should open and allow you to apply the following configuration as before.

[security]
allow_embedding = true

[auth.anonymous]
enabled = true
org_name = Main Org.
org_role = Viewer

[paths]
provisioning = /tmp/ray/session_latest/metrics/grafana/provisioning

Congratulations, you just downloaded all the necessary software needed for metrics tracking. Now, let’s start it.

Tracking metrics#

Before running your Flower simulation, you have to start the monitoring tools you have just installed and configured.

brew services start prometheus
brew services start grafana

Please include the following argument in your Python code when starting a simulation.

fl.simulation.start_simulation(
    # ...
    # all the args you used before
    # ...
    ray_init_args = {"include_dashboard": True}
)

Now, you are ready to start your workload.

Shortly after the simulation starts, you should see the following logs in your terminal:

2023-01-20 16:22:58,620       INFO [worker.py:1529](http://worker.py:1529/) -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265

You can look at everything at http://127.0.0.1:8265 .

It’s a Ray Dashboard. You can navigate to Metrics (on the left panel, the lowest option).

Or alternatively, you can just see them in Grafana by clicking on the right-up corner, “View in Grafana”. Please note that the Ray dashboard is only accessible during the simulation. After the simulation ends, you can only use Grafana to explore the metrics. You can start Grafana by going to http://localhost:3000/.

After you finish the visualization, stop Prometheus and Grafana. This is important as they will otherwise block, for example port 3000 on your machine as long as they are running.

brew services stop prometheus
brew services stop grafana

Resource allocation#

You must understand how the Ray library works to efficiently allocate system resources to simulation clients on your own.

Initially, the simulation (which Ray handles under the hood) starts by default with all the available resources on the system, which it shares among the clients. It doesn’t mean it divides it equally among all of them, nor that the model training happens at all of them simultaneously. You will learn more about that in the later part of this blog. You can check the system resources by running the following:

import ray
ray.available_resources()

In Google Colab, the result you see might be similar to this:

{'memory': 8020104807.0,
 'GPU': 1.0,
 'object_store_memory': 4010052403.0,
 'CPU': 2.0,
 'accelerator_type:T4': 1.0,
 'node:172.28.0.2': 1.0}

However, you can overwrite the defaults. When starting a simulation, do the following (you don’t need to overwrite all of them):

num_cpus = 2
num_gpus = 1
ram_memory = 16_000 * 1024 * 1024  # 16 GB
fl.simulation.start_simulation(
    # ...
    # all the args you were specifying before
    # ...
    ray_init_args = {
            "include_dashboard": True, # we need this one for tracking
            "num_cpus": num_cpus,
            "num_gpus": num_gpus,
            "memory": ram_memory,
  }
)

Let’s also specify the resource for a single client.

# Total resources for simulation
num_cpus = 4
num_gpus = 1
ram_memory = 16_000 * 1024 * 1024 # 16 GB

# Single client resources
client_num_cpus = 2
client_num_gpus = 1

fl.simulation.start_simulation(
    # ...
    # all the args you were specifying before
    # ...
    ray_init_args = {
            "include_dashboard": True, # we need this one for tracking
            "num_cpus": num_cpus,
            "num_gpus": num_gpus,
            "memory": ram_memory,
    },
    # The argument below is new
    client_resources = {
            "num_cpus": client_num_cpus,
            "num_gpus": client_num_gpus,
    }
)

Now comes the crucial part. Ray will start a new client only when it has all the required resources (such that they run in parallel) when the resources allow.

In the example above, only one client will be run, so your clients won’t run concurrently. Setting client_num_gpus = 0.5 would allow running two clients and therefore enable them to run concurrently. Be careful not to require more resources than available. If you specified client_num_gpus = 2, the simulation wouldn’t start (even if you had 2 GPUs but decided to set 1 in ray_init_args).

FAQ#

Q: I don’t see any metrics logged.

A: The timeframe might not be properly set. The setting is in the top right corner (“Last 30 minutes” by default). Please change the timeframe to reflect the period when the simulation was running.

Q: I see “Grafana server not detected. Please make sure the Grafana server is running and refresh this page” after going to the Metrics tab in Ray Dashboard.

A: You probably don’t have Grafana running. Please check the running services

brew services list

Q: I see “This site can’t be reached” when going to http://127.0.0.1:8265.

A: Either the simulation has already finished, or you still need to start Prometheus.

Resources#

Ray Dashboard: https://docs.ray.io/en/latest/ray-observability/getting-started.html

Ray Metrics: https://docs.ray.io/en/latest/cluster/metrics.html