High Celery CPU Usage in taiga-back-async

For some reason there is a Celery task from the taiga async backend that runs at super high CPU (100%) despite the software not being used. I can’t figure out what the celery task is that is using so much CPU - When I end the task, the app still seems to function.

Note, this is different than the RabbitMQ Celery issues, the process is only within the taiga async backend.

I’m self hosting this. I have somewhere around 10 projects with quite a few subtasks and tasks.

I can’t say as to when this started since I just noticed high cpu usage one day after not using the software for about a week.

How should I move forward here?

Well I’ve completely reinstalled taiga from scratch with no data and the problem persists.

also it’s taiga-async not taiga-back-async.

after commenting out the telemetry task, the heavy running taiga task no longer seems to be an issue. It must have been telemetry causing the issue.

Hi @initiativist,

First, welcome to our community and thank you very much for sharing it with us.

I have tried to recreate your scenario to reproduce the issue with no results. I’ve enabled the telemetry flag in a local docker hosting, loaded some projects (with related data in tasks/subtasks/issues), and left Taiga running for a while. Here are my metrics:

$ docker stats

CONTAINER ID   NAME                                   CPU %     MEM USAGE / LIMIT   MEM %     NET I/O          BLOCK I/O         PIDS
fcde3991b531   taiga-docker-taiga-db-1                0.00%     24.16MiB / 31GiB    0.08%     325kB / 435kB    12.2MB / 4.96MB   7
c27e92c82564   taiga-docker-taiga-front-1             0.00%     11.41MiB / 31GiB    0.04%     20kB / 142kB     9.61MB / 291kB    13
1316988bbd42   taiga-docker-taiga-protected-1         0.02%     82.61MiB / 31GiB    0.26%     18.3kB / 0B      270kB / 4.1kB     5
fadcd9946627   taiga-docker-taiga-events-rabbitmq-1   1.12%     113.8MiB / 31GiB    0.36%     707kB / 1.77MB   2.79MB / 6.65MB   45
93cd6bfaff99   taiga-docker-taiga-async-rabbitmq-1    1.19%     117.9MiB / 31GiB    0.37%     483kB / 449kB    1.13MB / 6.81MB   45
cbbaf92b83ce   taiga-docker-taiga-back-1              0.02%     298MiB / 31GiB      0.94%     448kB / 415kB    10.5MB / 8.19kB   37
a3f6b5a9ccff   taiga-docker-taiga-events-1            0.00%     31.66MiB / 31GiB    0.10%     18.1kB / 0B      4.9MB / 8.19kB    18
aa95253a31bf   taiga-docker-taiga-gateway-1           0.00%     10.83MiB / 31GiB    0.03%     318kB / 844kB    8.95MB / 4.1kB    13
c13ed8839243   taiga-docker-taiga-async-1             0.19%     305.8MiB / 31GiB    0.96%     444kB / 415kB    2.79MB / 651kB    7

$ watch -n 1 docker top taiga-docker-taiga-async-1 -ef

Every 1,0s: docker top taiga-docker-taiga-async-1 -ef

UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
matterm+            455546              455518              0                   09:04               ?                   00:00:03            /opt/venv/bin/python /opt/venv/bin/celery -A taiga
.celery worker -B --concurrency 4 -l INFO
matterm+            457089              455546              0                   09:04               ?                   00:00:01            /opt/venv/bin/python /opt/venv/bin/celery -A taiga
.celery worker -B --concurrency 4 -l INFO
matterm+            457103              455546              0                   09:04               ?                   00:00:00            /opt/venv/bin/python /opt/venv/bin/celery -A taiga
.celery worker -B --concurrency 4 -l INFO
matterm+            457115              455546              0                   09:04               ?                   00:00:00            /opt/venv/bin/python /opt/venv/bin/celery -A taiga
.celery worker -B --concurrency 4 -l INFO
matterm+            457116              455546              0                   09:04               ?                   00:00:00            /opt/venv/bin/python /opt/venv/bin/celery -A taiga
.celery worker -B --concurrency 4 -l INFO
matterm+            457117              455546              0                   09:04               ?                   00:00:00            /opt/venv/bin/python /opt/venv/bin/celery -A taiga

If I’m not wrong, you think the conflicting container is taiga-docker-taiga-async-1 (the one with the Celery backend) and not taiga-docker-taiga-events-rabbitmq-1. (receiving the RabbitMQ’s messages). Am I right?

Another question, how did you monitor the CPU usage when you got that constant 100% usage?

You also mentioned you commented out the telemetry task. We’re not sure if you simply disabled the docker flag ENABLE_TELEMETRY=False, or if you changed the method directly in taiga-back/taiga/telemetry/tasks.py:

@app.task
def send_telemetry():

Any way, it’s a bit strange the Telemetry task to be cause of the CPU load, as it’s just supposed to be executed once per day. You can check the scheduled time for the Celery telemetry task in taiga-back/taiga/celery.py:

if settings.ENABLE_TELEMETRY:
    rng = random.Random(settings.SECRET_KEY)
    hour = rng.randint(0, 4)
    minute = rng.randint(0, 59)
    app.conf.beat_schedule['send-telemetry-once-a-day'] = {
        'task': 'taiga.telemetry.tasks.send_telemetry',
        'schedule': crontab(minute=minute, hour=hour),
        'args': (),
    }

We’re the first interested in fixing it if it’s finally a bug, so, thank you in advance for any further info you provide that help us reproducing it.

Hi Daniel!

Thanks for your thorough response.

Your reproduction steps and assumptions about my situation are almost entirely correct, let me provide some more information.

I measured the CPU usage with HTOP, so it was 100% as a single core, not as the entire conceptual CPU. I measured it in various other ways as you have indicated above, and the usage does not present differently depending on the method that I use to measure CPU Usage. Consistently, it was an entire core. I also specifically used docker top for taiga-docker-taiga-async-1 and it was showing the same numbers.

here’s the output of docker top taiga-docker-taiga-async-1

UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
999                 2919184             2919150             15                  08:50               ?                   00:00:03            /opt/venv/bin/python /opt/venv/bin/celery -A taiga.celery worker -B --concurrency 4 -l INFO
999                 2919902             2919184             0                   08:50               ?                   00:00:00            /opt/venv/bin/python /opt/venv/bin/celery -A taiga.celery worker -B --concurrency 4 -l INFO
999                 2919914             2919184             0                   08:50               ?                   00:00:00            /opt/venv/bin/python /opt/venv/bin/celery -A taiga.celery worker -B --concurrency 4 -l INFO
999                 2919952             2919184             0                   08:50               ?                   00:00:00            /opt/venv/bin/python /opt/venv/bin/celery -A taiga.celery worker -B --concurrency 4 -l INFO
999                 2919953             2919184             0                   08:50               ?                   00:00:00            /opt/venv/bin/python /opt/venv/bin/celery -A taiga.celery worker -B --concurrency 4 -l INFO
999                 2919954             2919184             99                  08:50               ?                   00:00:22            /opt/venv/bin/python /opt/venv/bin/celery -A taiga.celery worker -B --concurrency 4 -l INFO

You can see the unfortunate 99 on the last service.

As for what I commented out, it was the contents of the taiga-back/taiga/telemetry/tasks.py:send_telemetry() task function. I had ENABLE_TELEMETRY set to False, and that did not fix the issue. the following command shows “False” to indicate that it was correctly passed over into the docker environment

host@host# docker compose exec taiga-async /bin/bash
root@container:/taiga-back# echo $ENABLE_TELEMETRY
False
host@host# docker compose exec taiga-back /bin/bash
root@container:/taiga-back# echo $ENABLE_TELEMETRY
False

You are correct that RabbitMQ is not causing the issue. In an oversight, I followed various troubleshooting steps to get RabbitMQ’s CPU load down, and it went down - but obviously it did not fix the issue as it was in the taiga-async container.

Here’s the output of docker top taiga-docker-taiga-async-rabbitmq-1

UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
100                 2919177             2919098             0                   08:50               ?                   00:00:00            /bin/sh /opt/rabbitmq/sbin/rabbitmq-server
100                 2919523             2919177             0                   08:50               ?                   00:00:00            /usr/local/lib/erlang/erts-12.3.2.1/bin/epmd -daemon
100                 2920508             2919177             2                   08:50               ?                   00:00:07            /usr/local/lib/erlang/erts-12.3.2.1/bin/beam.smp -W w -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 512 -MMmcs 30 -P 1048576 -t 5000000 -stbt db -zdbbl 128000 -sbwt none -sbwtdcpu none -sbwtdio none -B i -- -root /usr/local/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa -noshell -noinput -s rabbit boot -boot start_sasl -lager crash_log false -lager handlers []
100                 2920534             2920508             0                   08:50               ?                   00:00:00            erl_child_setup 1073741816
100                 2920836             2920534             0                   08:50               ?                   00:00:00            inet_gethost 4
100                 2920837             2920836             0                   08:50               ?                   00:00:00            inet_gethost 4

And the same for docker top taiga-docker-taiga-events-rabbitmq-1

UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
100                 2918554             2918482             0                   08:50               ?                   00:00:00            /bin/sh /opt/rabbitmq/sbin/rabbitmq-server
100                 2919456             2918554             0                   08:50               ?                   00:00:00            /usr/local/lib/erlang/erts-12.3.2.1/bin/epmd -daemon
100                 2920405             2918554             1                   08:50               ?                   00:00:07            /usr/local/lib/erlang/erts-12.3.2.1/bin/beam.smp -W w -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 512 -MMmcs 30 -P 1048576 -t 5000000 -stbt db -zdbbl 128000 -sbwt none -sbwtdcpu none -sbwtdio none -B i -- -root /usr/local/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa -noshell -noinput -s rabbit boot -boot start_sasl -lager crash_log false -lager handlers []
100                 2920455             2920405             0                   08:50               ?                   00:00:00            erl_child_setup 1073741816
100                 2920732             2920455             0                   08:50               ?                   00:00:00            inet_gethost 4
100                 2920733             2920732             0                   08:50               ?                   00:00:00            inet_gethost 4

I am unable to troubleshoot for a while, however, after some digging, I do suspect that the issue arose when I changed the DNS settings of my host machine which then got copied into the container on execution. A lot of the services that I am hosting were challenged by the DNS change, and several of them had telemetry as the issue, as it’s one of the only outwards directed functionalities in many of my services.

brief troubleshooting when I had the issue showed me that just adding the google and cloudflare dns as follows:

dns:
  - 8.8.8.8
  - 114.114.114.114

to the docker-compose.yml file under taiga-async did not fix the high cpu issue, so perhaps I needed to add an ipv6 resolution, or maybe it’s stuck some other way. If it was an internet issue, it may be reproducible through setting an iptables rule against the container ip. I’ll have to check that at another time.

Certainly the issue is still present, only instead of commenting out the code I just end the task when I start the compose project. Perhaps I will create a source volume mount and play around with bad internet. It’s always fun to dive into new codebases.

on a side note, what software are you using for this community forum? I quite like it!

Hi @initiativist

on a side note, what software are you using for this community forum? I quite like it!

That question is easier to answer. It’s Discourse , and we’re very glad you like it!

In an oversight, I followed various troubleshooting steps to get RabbitMQ’s CPU load down, and it went down

That sounds very interesting! Do you finally achieve to solve them?
I’m also investigating some CPU peaks in the RabbitMQ services, and your conclusions will be very welcome in this other thread: Unexpected CPU usage by RabbitMQ (beam.smp) under no load - #16 by viamonster

Perhaps I will create a source volume mount and play around with bad internet. It’s always fun to dive into new codebases.

Yes, it sounds like a badly managed connection problem when trying to send the telemetries data, probably caused in the DNS resolution. I would take a look to the send_telemetry task and the rudder_analytics library.

Don’t forget to share any advance in your research!

Hi Daniel,

Discourse is quite cool, I’m going to look into it for my own purposes.

Most of the RabbitMQ steps I followed are outlined under the link you provided, no new information there.

After a significant amount of troubleshooting, I think I’ve found the issue relating to high CPU usage.

It wasn’t the telemetry after all, as you’ve indicated you found improbable. The reason why commenting it out stopped the issue from occurring was because I neglected to put pass in the function and so it was failing to work in the first place. ah well.

As for the actual solution to the issue.

I started by logging all the task executions and monitoring with flower. I quickly found that none of the native tasks were to blame for the high CPU usage.

I used py-spy to record the time spent in each function and found close_open_fds() to be taking up 100% of the CPU time.

see:


[user@host taiga-docker]# py-spy top --pid <pid>                                                                              
                                                                                                                                                                                    
Collecting samples from '/opt/venv/bin/python /opt/venv/bin/celery -A taiga.celery worker -B --concurrency 4 -l DEBUG -E' (python v3.11.4)                                          
Total Samples 1900                                                                                                                                                                  
GIL: 60.00%, Active: 100.00%, Threads: 1                                                                                                                                            
                                                                                                                                                                                    
  %Own   %Total  OwnTime  TotalTime  Function (filename)                                                                                                                            
100.00% 100.00%   19.00s    19.00s   close_open_fds (billiard/compat.py)                                                                                                            
  0.00% 100.00%   0.000s    19.00s   start (celery/worker/worker.py)                                                                                                                
  0.00% 100.00%   0.000s    19.00s   run (celery/beat.py)                                                                                                                           
  0.00% 100.00%   0.000s    19.00s   main (click/core.py)                                                                                                                           
  0.00% 100.00%   0.000s    19.00s   _Popen (billiard/context.py)                                                                                                                   
  0.00% 100.00%   0.000s    19.00s   __init__ (billiard/popen_fork.py)                                                                                                              
  0.00% 100.00%   0.000s    19.00s   start (celery/bootsteps.py)                                                                                                                    
  0.00% 100.00%   0.000s    19.00s   _launch (billiard/popen_fork.py)                                                                                                               
  0.00% 100.00%   0.000s    19.00s   main (celery/bin/celery.py)                                                                                                                    
  0.00% 100.00%   0.000s    19.00s   main (celery/__main__.py)                                                                                                                      
  0.00% 100.00%   0.000s    19.00s   invoke (click/core.py)                                                                                                                         
  0.00% 100.00%   0.000s    19.00s   __call__ (click/core.py)                                                                                                                       
  0.00% 100.00%   0.000s    19.00s   caller (celery/bin/base.py)                                                                                                                    
  0.00% 100.00%   0.000s    19.00s   <module> (celery)                                                                                                                              
  0.00% 100.00%   0.000s    19.00s   start (billiard/process.py)                                                                                                                    
  0.00% 100.00%   0.000s    19.00s   worker (celery/bin/worker.py)                                                                                                                  
  0.00% 100.00%   0.000s    19.00s   _bootstrap (billiard/process.py)                                                                                                               
  0.00% 100.00%   0.000s    19.00s   new_func (click/decorators.py)  

That led me to this bug:

The temporary solution I adopted was to alter taiga-back/docker/async_entrypoint.sh as follows:

#!/usr/bin/env bash

# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this
# file, You can obtain one at http://mozilla.org/MPL/2.0/.
#
# Copyright (c) 2021-present Kaleidos Ventures SL

set -euo pipefail

# Give permission to taiga:taiga after mounting volumes
echo "Give permission to taiga:taiga"
chown -R taiga:taiga /taiga-back

# Start Celery processes
echo "Starting Celery..."
exec gosu taiga celery -A taiga.celery worker \
    --concurrency 4 \
    -l DEBUG \
    -E \
    "$@" &

exec gosu taiga celery -A taiga.celery beat -l DEBUG

certainly the -l DEBUG and -E commands aren’t necessary, just the separation of the beat and the workers.

I hope this is helpful for you and your team.