Spark cluster fails with NoSuchFileException on temporary connection files

Ask Question

Asked 5 months ago

Modified 5 months ago

Viewed 44 times

I have a Python celery application utilising Apache Spark for large-scale processing. Everything was going fine until today, when I received:

Exception in thread "main" java.nio.file.NoSuchFileException: /tmp/tmpkqdh2glm/connection16758665990471584352.info

Below is my docker-compose file. I have tried everything, but it seems I am missing something. Also, in between, it starts working normally and then goes back to throwing NoSuchFileException. Do you have any hint what I am doing wrong?

My PySpark and local Spark setup for the celery machine is also 4.0.0. The celery spark master and bindAddress are also set in the app. This setup was working perfectly fine till yesterday.

version: '3.9'

services:
  spark-master:
    image: bitnami/spark:4.0.0
    container_name: spark-master
    environment:
      - SPARK_MODE=master
      - SPARK_MASTER_HOST=spark-master
      - SPARK_MASTER_PORT=7077
      - SPARK_DRIVER_MEMORY=1g
      - SPARK_EXECUTOR_MEMORY=1g
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark
      - SPARK_WORKER_LOG_DIR=/app/.logs
      - SPARK_LOCAL_DIRS=/tmp/spark
      - SPARK_WORKER_DIR=/tmp/spark
      - SPARK_MASTER_DIR=/tmp/spark
    ports:
      - "8080:8080"
      - "7077:7077"
    networks:
      - network
    restart: always
    volumes:
      - logs:/app/.logs
      - spark-tmp:/tmp/spark

  spark-worker-1:
    image: bitnami/spark:4.0.0
    container_name: spark-worker-1
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_MEMORY=1g
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark
      - SPARK_WORKER_LOG_DIR=/app/.logs
      - SPARK_LOCAL_DIRS=/tmp/spark
      - SPARK_WORKER_DIR=/tmp/spark
    ports:
      - "8081:8081"
    networks:
      - network
    restart: always
    volumes:
      - logs:/app/.logs
      - spark-tmp:/tmp/spark

  celery-worker-1:
    container_name: celery-worker-1
    image: backend:latest
    command: celery -A utils.celery_utils worker --loglevel=info --concurrency=4
    env_file:
      - .env
    environment:
      - SPARK_DRIVER_HOST=celery-worker-1
      - CELERYD_PREFETCH_MULTIPLIER=1
      - SPARK_LOCAL_DIRS=/tmp/spark
    depends_on:
      - redis
      - spark-master
    networks:
      - network
    restart: always
    volumes:
      - logs:/app/.logs
      - spark-tmp:/tmp/spark

volumes:
  pgdata:
  logs:
  spark-tmp:
    driver: local
    driver_opts:
      type: tmpfs
      device: tmpfs

networks:
  network:
    driver: bridge

So far, I have tried mounting the tmp directory and creating tmpfs.

I expected the data to be written via Spark, as it always does, but in between, it crashed.

asked Jun 14 at 23:54

digital_monk

876 bronze badges

1

Please post the full stacktrace of the error if you can. It can help people get more context on how your process fails.

scr
– scr

2025-06-15 05:37:15 +00:00
Commented Jun 15 at 5:37
You can't mount your volume into both the driver and workers. They need to be separated

OneCricketeer
– OneCricketeer

2025-06-15 09:39:51 +00:00
Commented Jun 15 at 9:39

Add a comment |

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Spark cluster fails with NoSuchFileException on temporary connection files

0

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.