How to setup monitoring for your integration stack
When maintaining integration software, it is sometimes difficult to understand what is going on. Maybe there’s a scheduled job that triggers synchronization between databases. Or a new file on an SFTP server triggers a file transfer. Or a react app uses your API to retrieve data. For all those use cases you want to understand what’s been going on. Perhaps, you as a developer, know this exactly when you’re building it. But as soon as another project requires your focus and attention you tend to forget the details about all previous work. And you know that somewhere in the future your project manager will blame your integration software when the file hasn’t been transferred for a couple of days.
When trying to debug your problem, you search for hours through several log files for that one specific situation where it went wrong. Hopefully you’ll find it, but most of the time it’s already too late and log rotation already written over your older files.
Let’s go back a few steps. What can we do to overcome this simple problem? You already clicked on the link for this article, so perhaps you already know the answer. Monitoring. A clear dashboard that describes exactly what’s been going on. Perhaps your integration stack already has the possibility to create such a monitoring dashboard. With nice graphs to see how your JVM is holding up. What your CPU or memory usage is. How many requests and how many successful responses. Although this is great, it lacks one simple answer: Details. How many times has that specific file transfer been triggered? And how many times was that successful. And if it was not successful, why did it fail? What was the error message? If you see your memory usage spiking through the roof you could already get an idea why it happened, but most of the time this should not be the issue. You are a seasoned developer, you already thought about your resource requirements beforehand.
In my years of experience working as an integration developer, I learned that providing valuable information about your integrations in a clear way is mission critical for businesses. Open up the big black box and show what’s been going on. In this article, I would like to show you how to set up such a monitoring dashboard using the ELK-stack.
Here at HybrIT we love Mulesoft. In this article I will use the log files that Mulesoft produces. But you can use the same technique for other languages as well. As long as you are able to send or read log files.
ELK stands for Elasticsearch, Logstash and Kibana. And they are the key components for our monitoring dashboard. Elasticsearch is the core ‘database’ and stores JSON documents that can be easily indexed in many ways. Logstash is used to transform and enrich (log) data before it’s sent to Elasticsearch. And Kibana is the graphical user interface above Elasticsearch.
It is possible to use a fourth component. The filebeat agent. This is part of a list of specific beats agents all with a single specific goal. The filebeat agent reads data from a file and is able to send this somewhere. Filebeat is designed to work with log files. It is able to understand when new log entries are written or when a log file rotates. But if you’ve deployed your integration in the cloud, it’s possible that you cannot read those log files directly, or even install software like the Filebeat agent. Luckily there are other solutions out there. One of them is to use Log4J2 to send log files through our stack. Mulesoft have written an excellent article on how to set this up.
Let’s get started
At this stage, let me first explain what data is important for your dashboard. Every log entry should have the following variables available:
Correlation ID: This identifier uniquely identifies one single message. In Mulesoft this means one message that can travel over several API’s. With this correlation ID, you are able to filter on the specific log entries for one message only.
Interface Name: Most setups have multiple integrations running. Name those integrations. This will be helpfull for your non-technical colleagues to identify to which integration a specific error message belongs.
Together with those variables, we also need to mark specific points in a message. We want to log one specific message when the integration starts, and when it completes. But this should only happen only once per integration, per message. We can use a specific log message for this. For example:
The normal logger component in Mulesoft only provides one field to enter data in. Not very useful with above variables. You should use the JSON Logger plugin within Mulesoft to be able to add more variables to your logging file. This simply combines all the variables in a JSON document which we can read in Elasticsearch.
As a result, this should be your log file for one single message:
The JSON logger automatically adds extra fields which we will definitely use in the dashboard:
environment: To filter on an environment.
elapsed: To identify long running flows.
applicationName: To filter on an application.
These logs are written to a file which we will read using Filebeat.
Let’s start with Elasticsearch. You’ve installed one or more Elasticsearch instances and bound them together in a network. You should have also installed Kibana so that you can easily access this elasticsearch network. This article will not go in depth on how to install these. Instead, we will focus on the setup of indices, templates and policies.
At first, we need to have an index template that tells us what fields we need:
This index template is automatically applied when a new index is created based on the mentioned index_pattern. The mappings part here is where we define the necessary fields. We let Elasticsearch automatically add new fields to the mapping since we don’t know what type of fields we’ll log in the near future. But you can define a static list as well. I’ve added the most important fields for this article to get you started.
Notice the settings section? It also describes that we want to use an index lifecycle. So let’s create that one as well:
An ILM policy is helpfull when you want to optimize the storage of your logs. The most recent log entries are most of the times the most searched ones. Give me the monitoring status for the last 24 hours, for example. But every once in a while you want to navigate back a few days. But should be apply the same amount of replica’s and shards for older indexed logs? That will require lots of resource power. To help with this, you can define such an ILM policy in Elasticsearch.
You want to store your most searched indexed documents on hardware with the fastest profile. But this is not necessary anymore for less searched indexed documents.
What you define in this policy will be something you have to figure out yourself and depends on the company needs. But to get you started, here’s one that stores the data in a ‘hot’ tier for 7 days. Then transfers it to a ‘warm’ tier and shrinks the amount of shards to 1. After 31 days, indices are removed again:
The amount of days that you define in the min_age and max_age settings refer to the index age itself. Not the indexed document (= log entry) age.
Now that we’ve prepared the elastic setup, we’re ready to insert logging documents. Logstash is the most ideal way to insert such documents in elasticsearch. Within Logstash you define pipelines that have an input and an output. There are many inputs and outputs available. As input i’ve chosen the File Input plugin, which reads data directory from the log files. As output i’ve chosen the Elasticsearch output plugin that sends the data directory to Elasticsearch. Both plugins are available out-of-the-box in the Logstash installation.
The input section describes the path of the log files. You can use an asterisk as a wildcard to read all logs files. Logstash will read the file row for row. The multiline codec will combine rows that belong to each other. It does that by using the mentioned pattern. If a log record does not start with INFO, FATAL, ERROR, WARN, DEBUG or TRACE then it belongs to the previous row.
Next, we’ll define some filters to transform the raw log data before it will be inserted in Elasticsearch. The grok filter is used to parse non-JSON fields. With this filter, we can save the loglevel, the timestamp, correlationId and a few other things already. The correlationId is also provided with the JSON logger, but we want to read all log files. Including for example start up logs. Or exceptions. This way we already have the correlationId for each row.
Then we define the timestamp for this row. This date filter plugin simply parses a string that we already provided by the grok filter. This field is then used as the timestamp in Elasticsearch.
This timestamp is also used in the next age filter plugin. It drops documents that are older then 31 days. In our ILM policy we decided to delete indices that are older then 31 days. But the first time you run logstash, it will read log files starting from the beginning. Including older logs. This plugin is not provided by default and have to be installed. Simply run ‘bin/logstash-plugin install logstash-filter-age‘ from within your logstash installation directory.
The last JSON filter plugin is then used to parse the JSON in the message. To make sure we also keep NON-json messages (like start up logs, exceptions, etc), we set the skip_on_invalid_json to TRUE.
The elasticsearch output plugin then describes where the log document needs to be send to. Data streams are append-only indices and can be used as a single named resource for inserting data. Data streams are well-suited for logs, events, metrics, and other continuously generated data. To make sure logstash understands that we want to use this functionality, we set it to TRUE and set the type as well.
Now that we’ve define the pipeline, save it, add it to your pipelines.yml file in the logstash configuration directory, and restart Logstash. The insertion of logs into Elasticsearch will start.
The last part is to visualize the log data. In Kibana you can create custom visualizations and combine those in a dashboard. The first visualization will be a table listing all the different interfaces and how many times they where executed. Create a visualization in kibana and select Aggregation Based -> Data table. Select your data view.
At first we need to aggregate the data on interface type. But first we have to provide this interfaceType in our mulesoft project. In each JSON logger component, add the interfaceType key in the Content section. A convenient way is to set a ’traceMetaData’ variable in the start of the flow and use that variable in each of the following JSON logger components. Then add one JSON logger in the start of the flow with the message “Start processing”, one at the end of the flow with the message “Completed processing” and add another one in your global or local error handler with the message “Error while processing: #[error.description]”.
Then in the Kibana visualization, you can start by adding a Bucket (split rows). Select the “Terms” aggregation and set the field to your “interfaceType” field. Then we’ll define the “Started”, “Completed” and “Error” columns by adding these as Metrics. As aggregation, choose the “Sum Bucket” option. Within this Bucket, choose “Filters” as Aggregation and ‘message: “Start processing”‘ as filter. Give it the name “Started”. This will count the amount of logs that have “Start processing” in the message. Repeat this for the next column with ‘message: “Completed processing”‘ and another column with ‘message: “Error while processing”‘. The result is a table which tells you how many times an interface is started, completed, or ended in an error. The sum of the completed and error columns should be the same as the started column.
The next visualization is a timeline which describes the different type of logs. Create another visualization, Aggregation based, Timelion. Using ES queries you can define the different lines:
Then, we’ll add one last visualization, but this will not be a standard visualization. We need to provide the log data itself so you know what’s going on. You can use a saved search for this. Go to the Discover page and select a few columns that you want to show. Like ‘message’, ‘interfaceType’ and ‘correlationId’. The rest is up to you. Save this as “Log search”.
Now it’s time to create the Dashboard. Select the Dashboard option in the menu and click the “Create dashboard” button. Now select the creates visualizations and saved search from the “Add from library” button. The result should look something like this:
To investigate why a specific interface failed, you can filter down on that specific interface by hovering over the interface type in your created datatable, and click the magnifier button that appears. In your Log search, you will only see the logs for that specific interface.
You could even create another saved search which includes a filter “Error” messages to instantly show all errors. Like this:
This will provide you the error immediately. It also provides an easy way to filter down on the specific correlationId to understand what happens before the error was thrown.
Depending on your use cases you could create other visualizations based on these techniques as well. Like:
A pie graph listing how many response codes are given (Provide this in the JSON loggers, content section)
A line graph that shows the MAX / MIN or MEDIAN response time (The JSON logger adds the ‘elapsed’ field)
Create an invoice progress table by filtering the specific states as different columns like we did for interface types.