Introduction
We wanted to give you a quick intro to how you can use Log Analytics to monitor your Service Fabric cluster with the updated Service Fabric Analytics solution in OMS Log Analytics for Windows clusters that rolled out at the end of last month. The solution for Service Fabric has been enhanced to combine key Container and Platform level events into one comprehensive dashboard. This is particularly useful for users who have multiple applications on their cluster using either reliable services or containers or applications using both. We've now created a single dashboard that helps you to see what’s happening in your cluster, the underlying infrastructure, your applications, and in your containers. Let's dig into how you can set up your cluster to be monitored with Service Fabric Analytics and OMS Log Analytics.
Setup
Existing OMS users:
Users who already have the Service Fabric Analytics solution set up do not need to take any action as the solution will auto update. If you want to monitor containers or collect performance counters from your cluster, you can add the container solution and OMS Agent to your nodes, respectively.
New users:
Before adding the solution, make sure you have the WAD extension configured correctly to ensure that Service Fabric events and other required diagnostics data is collected from your cluster.
The new Service Fabric Analytics solution is available via the Azure Marketplace just like before; check out the instructions to add the solution here. This will show you how to configure your solution to read from the tables the WAD extension sends data to.
Next up, set up the OMS Agent to collect performance counters for your nodes. This is the recommended way to collect perf counters / metrics from your machines because of the ease with which you can pick and modify the counters you want collected. Add the OMS Agent to your nodes via the Azure CLI or by updating your cluster’s ARM template – see instructions here.
If you are running containers, you will also need to add the Container Monitoring Solution to your workspace and the OMS Agent to your nodes. This is because the current design of the OMS Agent requires that the Container Monitoring Solution is added to the Log Analytics workspace for it to start collecting docker logs and stats. This dependency on the containers solution is temporary, and we are working to add the same requirements to the Service Fabrics Analytics solution.
Using OMS Log Analytics
Once you have the resource(s) added, you can view the Service Fabric Analytics dashboard from within the Azure portal itself by clicking the tile under Summary on solution's blade.
This takes you to the consolidated dashboard with cluster and container data. Scroll to the right to view container deployments and performance metrics.
You can also run custom queries and create alerts. Click on one of the tiles or graph to navigate to "Log Search" in the workspace. For example, to alert on all events related to a cluster rolling back an upgrade, search for the following query and click “New Alert Rule”.
ServiceFabricOperationalEvent
| where EventId > 29628 and EventId < 29631
You can also configure the OMS Agent to collect specific performance counters. Navigate to the OMS Workspace’s page in the portal – from the solution’s page the workspace tab is on the left menu.
Once you’re on the workspace’s page, click on “Advanced settings” in the same left menu.
Then click on Data > Windows Performance Counters to start collecting specific counters from your nodes via the OMS Agent. Here is a list of perf counters that we recommend you collect. For those of you that are using Reliable Services or Actors in your applications, add the Service Fabric Actor, Actor Method, Service, and Service Method counters as well. More information can be found on these at Diagnostics and performance monitoring for Reliable Actors and Diagnostics and performance monitoring for Reliable Service Remoting.
This will allow you to see how your infrastructure is handling your workloads, and set relevant alerts based on resource utilization. For example – you may want to set an alert if the total Processor utilization (Processor(_Total)% Processor Time) goes above 80% or below 5%. The counter name you would use for this is “% Processor Time.” You could do this by creating an alert rule for the following query:
Perf
| where CounterName == "% Processor Time" and InstanceName == "_Total"
| where CounterValue >= 80 or CounterValue <= 5.
Recommendations
We recommend using the WAD extension for Windows clusters to get platform, reliable services, and reliable actors events flowing into this solution. WAD is also extensible to send data to other solutions via ‘sinks’ such as Application Insights or Event Hubs.
For infrastructure and container monitoring, we recommend using the OMS agent to collect any performance counters (including Service Fabric specific counters) and container metrics.