Skip to content

Weird FileNotFoundError during single card training #17

@njupopsicle

Description

@njupopsicle

Hello, I try to reproduce your result with a40-48g * 1 and Qwen2.5-Coder-1.5B-Instruct. I follow the readme file to config my environment, and launch the training process with the script below.

export VLLM_ATTENTION_BACKEND=XFORMERS
export CUDA_VISIBLE_DEVICES=1
export HYDRA_FULL_ERROR=1

DATA_DIR_PATH=data

RUN_ID=1.5B
GPU_ENV=1GPU
MODEL_ENV=Qwen2.5-Coder-1.5B-Instruct
PROJECT_NAME=SQL-R1
        
LOG_PATH=logs/$PROJECT_NAME
MODEL_PATH=models/$MODEL_ENV
EXPERIMENT_NAME=$GPU_ENV-$MODEL_ENV-$RUN_ID

mkdir -p $LOG_PATH

set -x

python -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=$DATA_DIR_PATH/train.parquet \
    data.val_files=$DATA_DIR_PATH/test.parquet \
    data.train_batch_size=1 \
    data.val_batch_size=1 \
    data.max_prompt_length=4096 \
    data.max_response_length=2048 \
    actor_rollout_ref.model.path=$MODEL_PATH \
    actor_rollout_ref.actor.optim.lr=3e-7 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=1 \
    actor_rollout_ref.actor.ppo_micro_batch_size=1 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.grad_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size=80 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.rollout.temperature=1.1 \
    actor_rollout_ref.ref.log_prob_micro_batch_size=80 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.critic_warmup=0 \
    trainer.logger=['wandb'] \
    trainer.project_name=$PROJECT_NAME \
    trainer.experiment_name=$EXPERIMENT_NAME \
    trainer.n_gpus_per_node=1 \
    trainer.nnodes=1 \
    trainer.default_local_dir=$LOG_PATH/$EXPERIMENT_NAME \
    trainer.default_hdfs_dir=null \
    trainer.save_freq=100 \
    trainer.test_freq=100 \
    trainer.total_epochs=10 $@ 2>&1 | tee $LOG_PATH/$MODEL_ENV/grpo.log

Howerver, I encounter the problem below

  File "/home/spotteddove/text2sql/sql-r1/verl/trainer/main_ppo.py", line 186, in main_task
    trainer.fit()
  File "/home/spotteddove/text2sql/sql-r1/verl/trainer/ppo/ray_trainer.py", line 600, in fit
    val_metrics = self._validate()
  File "/home/spotteddove/text2sql/sql-r1/verl/trainer/ppo/ray_trainer.py", line 453, in _validate
    reward_tensor = self.val_reward_fn(test_batch)
  File "/home/spotteddove/text2sql/sql-r1/verl/trainer/main_ppo.py", line 76, in __call__
    score = compute_score_fn(solution_str=sequences_str, ground_truth=ground_truth)
  File "/home/spotteddove/text2sql/sql-r1/verl/utils/reward_score/synsql.py", line 166, in compute_score
    exec_status = func_timeout(
  File "/home/spotteddove/miniconda3/envs/rlvr/lib/python3.9/site-packages/func_timeout/dafunc.py", line 108, in func_timeout
    raise_exception(exception)
  File "/home/spotteddove/miniconda3/envs/rlvr/lib/python3.9/site-packages/func_timeout/py3_raise.py", line 7, in raise_exception
    raise exception[0] from None
  File "/home/spotteddove/text2sql/sql-r1/verl/utils/reward_score/exec_eval.py", line 200, in eval_exec_match
    db_paths = [os.path.join(db_dir, basename) for basename in os.listdir(db_dir) if '.sqlite' in basename]
FileNotFoundError: [Errno 2] No such file or directory: "data/NL2SQL/SynSQL-2.5M/databases/the_table's_domain_appears_to_be_related_to_character_progression_or_rankin"

which adds a single quotation (') to the filename (the_tables_domain... ---> the_table's_domain...), and cause the FileNotFoundError. I ensure that the file data/NL2SQL/SynSQL-2.5M/databases/the_tables_domain_appears_to_be_related_to_character_progression_or_rankin exists.

Have you encounterred such problem during devloping SQL-R1? I am looking forward to hearing from you. Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions