Você está na página 1de 56

Even though this course aims to teach practical concepts behind monitoring, we still need the

tools to monitor things with! We'll be using a combination of Prometheus, Alertmanager, and
Grafana — Prometheus being a pull-based monitoring and alerting solution, with Alertmanager
collecting any alerts from Prometheus and pushing notifications, and Grafana compiling and
collecting all our metrics to create visualizations.
If we're going to have a monitoring course, we need something to monitor! Part of that is
going to be our Ubuntu 18.04 host, but another equally important part is going to be a
web application that already exists on the provided Playground server for this course.
The application is a simple to-do list program called Forethought that uses the Express
web framework to do most of the hard work for us. The application has also been
Dockerized and saved as an image (also called forethought) and is ready for us to
deploy.
Want to use your own server and now the provided Playground? See steps in the study
guide!

Steps in This Video


1. List the contents of the forethought directory and subdirectories:
2. $ ls -d

3. Confirm the creation of the existing Docker image:


4. $ docker image list

5. Deploy the web application to a container. Map port 8080 on the container to port
80 on the host:
6. $ docker run --name ft-app -p 80:8080 -d forethought

7. Check that the application is working correctly by visiting the server's provided
URL.
Prometheus Setup

Now that we have what we're monitoring set up, we need to get our monitoring tool itself
up and running, complete with a service file. Prometheus is a pull-based monitoring
system that scrapes various metrics set up across our system and stores them in a
time-series database, where we can use a web UI and the PromQL language to view
trends in our data. Prometheus provides its own web UI, but we'll also be pairing it with
Grafana later, as well as an alerting system.
Steps in This Video
1. Create a system user for Prometheus:
2. sudo useradd --no-create-home --shell /bin/false prometheus

3. Create the directories in which we'll be storing our configuration files and
libraries:
4. sudo mkdir /etc/prometheus
5. sudo mkdir /var/lib/prometheus

6. Set the ownership of the /var/lib/prometheus directory:


7. sudo chown prometheus:prometheus /var/lib/prometheus

8. Pull down the tar.gz file from the Prometheus downloads page:
9. cd /tmp/
10. wget https://github.com/prometheus/prometheus/releases/download/v2.7.1/promet
heus-2.7.1.linux-amd64.tar.gz

11. Extract the files:


12. tar -xvf prometheus-2.7.1.linux-amd64.tar.gz

13. Move the configuration file and set the owner to the prometheus user:
14. sudo mv console* /etc/prometheus
15. sudo mv prometheus.yml /etc/prometheus
16. sudo chown -R prometheus:prometheus /etc/prometheus

17. Move the binaries and set the owner:


18. sudo mv prometheus /usr/local/bin/
19. sudo mv promtool /usr/local/bin/
20. sudo chown prometheus:prometheus /usr/local/bin/prometheus
21. sudo chown prometheus:prometheus /usr/local/bin/promtool

22. Create the service file:


23. sudo $EDITOR /etc/systemd/system/prometheus.service

Add:
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries

[Install]
WantedBy=multi-user.target

Save and exit.


24. Reload systemd:
25. sudo systemctl daemon-reload

26. Start Prometheus, and make sure it automatically starts on boot:


27. sudo systemctl start prometheus
28. sudo systemctl enable prometheus

29. Visit Prometheus in your web browser at PUBLICIP:9090.


Alertmanager Setup
Monitoring is never just monitoring. Ideally, we'll be recording all these metrics and
looking for trends so we can better react when things go wrong and make smart
decisions. And once we have an idea of what we need to look for when things go
wrong, we need to make sure we know about it. This is where alerting applications like
Prometheus's standalone Alertmanager come in.

Steps in This Video


1. Create the alertmanager system user:
2. sudo useradd --no-create-home --shell /bin/false alertmanager

3. Create the /etc/alertmanager directory:


4. sudo mkdir /etc/alertmanager

5. Download Alertmanager from the Prometheus downloads page:


6. cd /tmp/
7. wget https://github.com/prometheus/alertmanager/releases/download/v0.16.1/ale
rtmanager-0.16.1.linux-amd64.tar.gz

8. Extract the files:


9. tar -xvf alertmanager-0.16.1.linux-amd64.tar.gz

10. Move the binaries:


11. sudo mv alertmanager /usr/local/bin/
12. sudo mv amtool /usr/local/bin/

13. Set the ownership of the binaries:


14. sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager
15. sudo chown alertmanager:alertmanager /usr/local/bin/amtool

16. Move the configuration file into the /etc/alertmanager directory:


17. sudo mv alertmanager.yml /etc/alertmanager/

18. Set the ownership of the /etc/alertmanager directory:


19. sudo chown -R alertmanager:alertmanager /etc/alertmanager/

20. Create the alertmanager.service file for systemd:


21. sudo $EDITOR /etc/systemd/system/alertmanager.service
22.
23. [Unit]
24. Description=Alertmanager
25. Wants=network-online.target
26. After=network-online.target
27.
28. [Service]
29. User=alertmanager
30. Group=alertmanager
31. Type=simple
32. WorkingDirectory=/etc/alertmanager/
33. ExecStart=/usr/local/bin/alertmanager \
34. --config.file=/etc/alertmanager/alertmanager.yml
35.
36. [Install]
37. WantedBy=multi-user.target

Save and exit.


38. Stop Prometheus, and then update the Prometheus configuration file to use
Alertmanager:
39. sudo systemctl stop prometheus
40. sudo $EDITOR /etc/prometheus/prometheus.yml
41.
42. alerting:
43. alertmanagers:
44. - static_configs:
45. - targets:
46. - localhost:9093

47. Reload systemd, and then start the prometheus and alertmanager services:
48. sudo systemctl daemon-reload
49. sudo systemctl start prometheus
50. sudo systemctl start alertmanager

51. Make sure alertmanager starts on boot:


52. sudo systemctl enable alertmanager

53. Visit PUBLICIP:9093 in your browser to confirm Alertmanager is working.

https://prometheus.io/download/
cd /tmp
Comment out alert manager line
Grafana Setup
While Prometheus provides us with a web UI to view our metrics and craft charts, the
web UI alone is often not the best solution to visualizing our data. Grafana is a robust
visualization platform that will allow us to better see trends in our metrics and give us
insight into what's going on with our applications and servers. It also lets us use multiple
data sources, not just Prometheus, which gives us a full view of what's happening.

Steps in This Video


1. Install the prerequisite package:
2. sudo apt-get install libfontconfig

3. Download and install Grafana using the .deb package provided on the Grafana
download page:
4. wget https://dl.grafana.com/oss/release/grafana_5.4.3_amd64.deb
5. sudo dpkg -i grafana_5.4.3_amd64.deb

6. Ensure Grafana starts at boot:


7. sudo systemctl enable --now grafana-server

8. Access Grafana's web UI by going to IPADDRESS:3000.


9. Log in with the username admin and the password admin. Reset the password
when prompted.

Add a Data Source

1. Click Add data source on the homepage.


2. Select Prometheus.
3. Set the URL to http://localhost:9090.
4. Click Save & Test.

Add a Dashboard

1. From the left menu, return Home.


2. Click New dashboard. The dashboard is automatically created.
3. Click on the gear icon to the upper right.
4. Set the Name of the dashboard to Forethought.
5. Save the changes.
https://grafana.com/docs/installation/rpm/

https://grafana.com/docs/v3.1/installation/rpm/

https://www.fosslinux.com/8328/how-to-install-and-configure-grafana-on-centos-7.htm

https://www.fosslinux.com/10398/how-to-install-and-configure-prometheus-on-centos-7.htm/

https://www.fosslinux.com/8424/install-and-configure-check_mk-server-on-centos-7.htm
https://grafana.com/grafana/download
Username and password is admin
Click add data source
Push or Pull
Within monitoring there is an age-old battle that puts the debate between Vim versus
Emacs to shame: whether or not to use a push- or pull-based monitoring solution. And
while Prometheus is a pull-based monitoring system, it's important to know your options
before actually implementing your monitoring — after all, this is a course about
gathering and using your monitoring data, not a course on Prometheus itself.

Pull-Based Monitoring
When using a pull system to monitor your environments and applications, we're having
the monitoring solution itself query our metrics endpoints, such as the one located
at :3000/metrics on our Playground server itself. This is specifically our Grafana
metrics, but it looks the same regardless of the endpoint.
Pull-based systems allow us to better check the status of our targets, let us run
monitoring from virtually anywhere, and provide us with web endpoints we can check for
our metrics. That said, they are not without their concerns: Since a pull-based system is
doing the scraping, the metrics might not be as "live" as an event-based push system,
and if you have a particularly complicated network setup, then it might be difficult to
grant the monitoring solution access to all the endpoints it needs to connect with.

Push-Based Monitoring
Push-based monitoring solutions offload a lot of the "work" from the monitoring platform
to the endpoints themselves: The endpoints are the ones that push their metrics up to
the monitoring application. Push systems are especially useful when you need event-
based monitoring, and can't wait every 15 or so seconds for the data to be pulled in.
They also allow for greater modularity, offloading most of the difficult work to the clients
they serve.
That said, many push-based systems have greater setup requirements and overhead
than pull-based ones, and the majority of the managing isn't done through only the
monitoring server.

Which to Choose
Despite the debate, one system is not necessarily better than the other, and a lot of it
will depend on your individual needs. Not sure which is best for you? I would suggest
taking the time to set a system of either type up on a dev environment and note the pain
points — because anything causing trouble on a test environment is going to cause
bigger problems on production, and those issues will most likely dictate which system
works best for you.
Patterns and Anti-Patterns

Unfortunately for us, there are a lot of ways to do inefficient monitoring. From monitoring
the wrong thing to spending too much time setting up the coolest new monitoring tool,
monitoring can often become a relentless series of broken and screaming alerts for
problems we're not sure how to fix. In this lesson, we'll address some of the most
common monitoring issues and think about how to avoid them.

Thinking It's About the Tools


While finding the right tool is important, having a select amount of carefully curated
monitoring tools that suit your needs will take you much farther than simply using a tool
because you heard it was the best. Never try to force your needs to fit a tool's abilities.

Falling into Cargo Cults


Just because Google does it doesn't mean we should! Just as we need to think about
our needs when we select our tools, we also need to think about our needs when we set
them up. Ask yourself why you're monitoring something the way you are, and consider
how that monitoring affects your alerting. Is the CPU alarm going off because of an
unknown CPU problem, or should the "application spun up too many processes" alarm
be going off instead?

Net Embracing Automation


No one should be manually enrolling their services into Prometheus — or any
monitoring solution! Automating the process of enrollment from the start will allow
monitoring to happen more naturally and prevent tedious, easily forgotten tasks. We
also want to take the time to look at our runbooks and see what problems can have
automated solutions.

Leaving One Person in Charge


Monitoring is something everyone should be at least a little considerate of — and it
definitely shouldn't just be the job of one person. Instead, monitoring should be
considered from the very start of a project, and any work needed to monitor a service
should be planned.

Service Discovery

We've used a lot of terms interchangeably in this course up until now


— client, service, endpoint, target — but all these things are just something we are
monitoring. And the process of our monitoring system discovering what we're monitoring
is called service discovery. While we'll be doing it manually throughout this course
(since we only have a very minimal system), in practice, we'd want to consider
automating the task out by using some kind of service discovery tool.
Tool Options
 Consul
 Zookeeper
 Nerve
 Any service discovery tool native to your existing platform:
o AWS
o Azure
o GCP
o Kubernetes
o Marathon
o ... and more!

Steps in This Video


1. Go to YOURLABSERVER:9090 and navigate to the Targets page, under Status in
the main menu; notice how we only have one target.
2. On your server, open the Prometheus configuration file:
3. $ sudo $EDITOR /etc/prometheus/prometheus.yml

4. Add targets for Grafana and Alertmanager:


5. scrape_configs:
6. # The job name is added as a label `job=<job_name>` to any timeseries scrap
ed from this config.
7. - job_name: 'prometheus'
8.
9. # metrics_path defaults to '/metrics'
10. # scheme defaults to 'http'.
11.
12. static_configs:
13. - targets: ['localhost:9090']
14. - job_name: 'alertmanager'
15. static_configs:
16. - targets: ['localhost:9093']
17. - job_name: 'grafana'
18. static_configs:
19. - targets: ['localhost:3000']

Save and exit when done.


20. Restart Prometheus:
21. $ sudo systemctl restart prometheus

22. Refresh the Targets page on the web UI. All three targets are now available!

Add grafana
Infrastructure Monitoring

Using the Node Exporter


Right now, our monitoring system only monitors itself; which, while beneficial, is not the
most helpful when it comes to maintaining and monitoring all our systems as a whole.
We instead have to add endpoints that will allow Prometheus to scrape data for our
application, container, and infrastructure. In this lesson, we'll be starting with
infrastructure monitoring by introducing Prometheus's Node Exporter. The Node
Exporter sends system data to Prometheus via a metrics page with minimal setup on
our part, leaving us to focus on more practical tasks.
Much like Prometheus and Alertmanager, to add an exporter to our server, we need to
do a little bit of leg work.

Steps in This Video


1. Create a system user:
2. $ sudo useradd --no-create-home --shell /bin/false node_exporter

3. Download the Node Exporter from Prometheus's download page:


4. $ cd /tmp/
5. $ wget https://github.com/prometheus/node_exporter /releases/download/v
0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz

6. Extract its contents; note that the versioning of the Node Exporter may be
different:
7. $ tar -xvf /releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz

8. Move into the newly created directory:


9. $ cd node_exporter-0.17.0.linux-amd64/

10. Move the provided binary:


11. $ sudo mv node_exporter /usr/local/bin/

12. Set the ownership:


13. $ sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

14. Create a systemd service file:


15. $ sudo vim /etc/systemd/system/node_exporter.service
16.
17. [Unit]
18. Description=Node Exporter
19. After=network.target
20.
21. [Service]
22. User=node_exporter
23. Group=node_exporter
24. Type=simple
25. ExecStart=/usr/local/bin/node_exporter
26.
27. [Install]
28. WantedBy=multi-user.target

Save and exit when done.


29. Start the Node Exporter:
30. $ sudo systemctl daemon-reload
31. $ sudo systemctl start node_exporter

32. Add the endpoint to the Prometheus configuration file:


33. $ sudo $EDITOR /etc/prometheus/prometheus.yml
34.
35. - job_name: 'node_exporter'
36. static_configs:
37. - targets: ['localhost:9100']

38. Restart Promtheus:


$ sudo systemctl restart prometheus
39. Navigate to the Prometheus web UI. Using the expression editor, search
for cpu, meminfo, and related system terms to view the newly added metrics.
40. Search for node_memory_MemFree_bytes in the expression editor; shorten the time
span for the graph to be about 30 minutes of data.
41. Back on the terminal, download and run stress to cause some memory spikes:
42. $ sudo apt-get install stress
43. $ stress -m 2

44. Wait for about one minute, and then view the graph to see the difference in
activity.
CPU Metrics
Memory Metrics

Run stress -m 1 on your server before starting this lesson.


When it comes to looking at our memory metrics, there are a few core metrics we want
to consider. Memory metrics for Prometheus and other monitoring systems are retreived
through the /proc/meminfo file; in Prometheus in particular, these metrics are prefixed
with node_memory in the expression editor, and quite a number of them exist. However,
of the vast array of memory information we have access to, there are only a few core
ones we will have to concern ourselves with much of the time:

 node_memory_MemTotal_bytes
 node_memory_MemFree_bytes
 node_memory_MemAvailable_bytes
 node_memory_Buffers_bytes
 node_memory_Cached_bytes

Those who do a bit of systems administration, incident response, and the like have
probably used free before to check the memory of a system. The metric expressions
listed above provide us with what is essentially the same data as free but in a time
series where we can witness trends over time or compare memory between multiple
system builds.
node_memory_MemTotal_bytes provides us with the amount of memory on the server as
a whole — in other words, if we have 64 GB of memory, then this would always be 64
GB of memory, until we allocate more. While on its own this is not the most helpful
number, it helps us calculate the amount of in-use memory:
node_memory_MemTotal_bytes - node_memory_MemFree_bytes
Here, node_memory_MemFree_bytes denotes the amount of free memory left on the
system, not including caches and buffers that can be cleared. To see the amount
of available memory, including caches and buffers that can be opened up, we would
use node_memory_MemAvailable_bytes. And if we wanted to see the cache and buffer
data itself, we would use node_memory_Cached_bytes and node_memory_Buffers_bytes,
respectively.
Disk Metrics

Run stress -1 40 on your server before starting this lesson.


Disk metrics are specifically related to the performance of reads and writes to our disks,
and are most commonly pulled from /proc/diskstats. Prefixed with node_disk, these
metrics track both the amount of data being processed during I/O operations and the
amount of time these operations take, among some other features.
The Node Exporter filters out any loopback devices automatically, so when we view our
metric data in the expression editor, we get only the information we need without a lot of
noise. For example, if we run iostat -x on our terminal, we'll receive detailed
information about our xvda device on top of five loop devices.
Now, we can collect information similar to iostat -x itself across a time series via our
expression editor. This includes using irate to view the disk usage of this I/O operation
across our host:
irate(node_disk_io_time_seconds_total[30s])

Additionally, we can use the node_disk_io_time_seconds_total metric alongside


our node_disk_read_time_seconds_total and node_disk_write_time_seconds_total m
etrics to calculate the percentage of time spent on each kind of I/O operation:
irate(node_disk_read_time_seconds_total[30s]) / irate(node_disk_io_time_seconds_total
[30s])

irate(node_disk_write_time_seconds_total[30s]) / irate(node_disk_io_time_seconds_tota
l[30s])

Additionally, we're also provided with a gauge-based metric that lets us see how many
I/O operations are occurring at a point in time:
node_disk_io_now

Other metrics include:

 node_disk_read_bytes_total and node_disk_written_bytes_total, which track the


amount of bytes read or written, respectively
 node_disk_reads_completed_total and node_disk_writes_completed_total , which
track the amount of reads and writes
 node_disk_reads_merged_total and node_disk_writes_merged_total , which track
read and write merges
File System Metrics

ile system metrics contain information about our mounted file systems. These metrics
are taken from a few different sources, but all use the node_filesystemprefix when we
view them in Prometheus.
Although most of the seven metrics we're provided here are fairly straightforward, there
are some caveats we want to address — the first being the difference
between node_filesystem_avail_bytes and node_filesystem_free_bytes. While for
some systems these two metrics may be the same, in many Unix systems a portion of
the disk is reserved for the root user. In this
case, node_filesystem_free_bytes contains the amount of free space, including the
space reserved for root, while node_filesystem_avail_bytes contains only the
available space for all users.
Let's go ahead and look at the node_filesystem_avail_bytes metric in our expression
editor. Notice how we have a number of file systems mounted that we can view: Our
main xvda disk, the LXC file system for our container, and various temporary file
systems. If we wanted to limit which file systems we view on the graph, we can uncheck
the systems we're not interested in.
The file system collector also supplies us with more labels than we've previously seen.
Labels are the key-value pairs we see in the curly brackets next to the metric. We can
use these to further manipulate our data, as we saw in previous lessons. So, if we
wanted to view only our temporary file systems, we can use:
node_filesystem_avail_bytes{fstype="tmpfs"}

Of course, these features can be used across all metrics and are not just limited to the
file system. Other metrics may also have their own specific labels, much like
the fstype and mountpoint labels here.
Networking Metrics

When we discuss network monitoring through the Node Exporter, we're talking about
viewing networking data from a systems administration or engineering viewpoint: The
Node Exporter provides us with networking device information pulled both
from /proc/net/dev and /sys/class/net/INTERFACE, with INTERFACEbeing the name of
the interface itself, such as eth0. All network metrics are prefixed with
the node_network name.
Should we take a look at node_network in the expression editor, we can see quite a
number of options — many of these are information gauges whose data is pulled from
that /sys/class/net/INTERFACE directory. So, when we look at node_network_dormant,
we're seeing point-in-time data from the /sys/class/net/INTERFACE/dormant file.
But with regards to metrics that the average user will need in terms of day-to-day
monitoring, we really want to look at the metrics prepended with
either node_network_transmit or node_network_receive, as this contains information
about the amount of data/packets that pass through our networking, both outbound
(transmit) and inbound (receive). Specifically, we want to look at
the node_network_receive_bytes_total or node_network_transmit_bytes_totalmetric
s, because these are what will help us calculate our network bandwidth:
rate(node_network_transmit_bytes_total[30s])
rate(node_network_receive_bytes_total[30s])

The above expressions will show us the 30-second average of bytes either transmitted
or received across our time series, allowing us to see when our network bandwidth has
spiked or dropped.
Load Metrics

When we talk about load, we're referencing the amount of processes waiting to be
served by the CPU. You've probably seen these metrics before: They're sitting at the
top of any top command run, and are available for us to view in the /proc/loadavg file.
Taken every 1, 5, and 15 minutes, the load average gives us a snapshot of how hard
our system is working. We can view these statistics in Prometheus
at node_load1, node_load5, and node_load15.
That said, load metrics are mostly useless from a monitoring standpoint. What is a
heavy load to one server can be an easy load for another, and beyond looking at any
trends in load in the time series, there is nothing we can alert on here nor any real data
we can extract through queries or any kind of math.

Using cAdvisor to Monitor Containers

Although we have our host monitored for various common metrics at this time, the Node
Exporter doesn't cross the threshold into monitoring our containers. Instead, if we want
to monitor anything we have in Docker, including our application, we need to add a
container monitoring solution.
Lucky for us, Google's cAdvisor is an open-source solution that works out of the box
with most container platforms, including Docker. And once we have cAdvisor installed,
we can see much of the same metrics we see for our host on our container, only these
are provided to us through the prefix container.
cAdvisor also monitors all our containers automatically. That means when we view a
metric, we're seeing it for everything that cAvisor monitors. Should we want to target
specific containers, we can do so by using the name label, which pulls the container
name from the name it uses in Docker itself.

Steps in This Video


1. Launch cAdvisor:
2. $ sudo docker run \
3. --volume=/:/rootfs:ro \
4. --volume=/var/run:/var/run:ro \
5. --volume=/sys:/sys:ro \
6. --volume=/var/lib/docker/:/var/lib/docker:ro \
7. --volume=/dev/disk/:/dev/disk:ro \
8. --publish=8000:8080 \
9. --detach=true \
10. --name=cadvisor \
11. google/cadvisor:latest

12. List available containers to confirm it's working:


13. $ docker ps

14. Update the Prometheus config:


15. $ sudo $EDITOR /etc/prometheus/prometheus.yml
16.
17. - job_name: 'cadvisor'
18. static_configs:
19. - targets: ['localhost:8000']

20. Restart Prometheus:


$ sudo systemctl restart prometheus

Você também pode gostar