Weird FileNotFoundError during single card training

Hello, I try to reproduce your result with `a40-48g` * 1 and `Qwen2.5-Coder-1.5B-Instruct`. I follow the readme file to config my environment, and launch the training process with the script below.

```bash
export VLLM_ATTENTION_BACKEND=XFORMERS
export CUDA_VISIBLE_DEVICES=1
export HYDRA_FULL_ERROR=1

DATA_DIR_PATH=data

RUN_ID=1.5B
GPU_ENV=1GPU
MODEL_ENV=Qwen2.5-Coder-1.5B-Instruct
PROJECT_NAME=SQL-R1
        
LOG_PATH=logs/$PROJECT_NAME
MODEL_PATH=models/$MODEL_ENV
EXPERIMENT_NAME=$GPU_ENV-$MODEL_ENV-$RUN_ID

mkdir -p $LOG_PATH

set -x

python -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=$DATA_DIR_PATH/train.parquet \
    data.val_files=$DATA_DIR_PATH/test.parquet \
    data.train_batch_size=1 \
    data.val_batch_size=1 \
    data.max_prompt_length=4096 \
    data.max_response_length=2048 \
    actor_rollout_ref.model.path=$MODEL_PATH \
    actor_rollout_ref.actor.optim.lr=3e-7 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=1 \
    actor_rollout_ref.actor.ppo_micro_batch_size=1 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.grad_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size=80 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.rollout.temperature=1.1 \
    actor_rollout_ref.ref.log_prob_micro_batch_size=80 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.critic_warmup=0 \
    trainer.logger=['wandb'] \
    trainer.project_name=$PROJECT_NAME \
    trainer.experiment_name=$EXPERIMENT_NAME \
    trainer.n_gpus_per_node=1 \
    trainer.nnodes=1 \
    trainer.default_local_dir=$LOG_PATH/$EXPERIMENT_NAME \
    trainer.default_hdfs_dir=null \
    trainer.save_freq=100 \
    trainer.test_freq=100 \
    trainer.total_epochs=10 $@ 2>&1 | tee $LOG_PATH/$MODEL_ENV/grpo.log
```

Howerver, I encounter the problem below

```bash
  File "/home/spotteddove/text2sql/sql-r1/verl/trainer/main_ppo.py", line 186, in main_task
    trainer.fit()
  File "/home/spotteddove/text2sql/sql-r1/verl/trainer/ppo/ray_trainer.py", line 600, in fit
    val_metrics = self._validate()
  File "/home/spotteddove/text2sql/sql-r1/verl/trainer/ppo/ray_trainer.py", line 453, in _validate
    reward_tensor = self.val_reward_fn(test_batch)
  File "/home/spotteddove/text2sql/sql-r1/verl/trainer/main_ppo.py", line 76, in __call__
    score = compute_score_fn(solution_str=sequences_str, ground_truth=ground_truth)
  File "/home/spotteddove/text2sql/sql-r1/verl/utils/reward_score/synsql.py", line 166, in compute_score
    exec_status = func_timeout(
  File "/home/spotteddove/miniconda3/envs/rlvr/lib/python3.9/site-packages/func_timeout/dafunc.py", line 108, in func_timeout
    raise_exception(exception)
  File "/home/spotteddove/miniconda3/envs/rlvr/lib/python3.9/site-packages/func_timeout/py3_raise.py", line 7, in raise_exception
    raise exception[0] from None
  File "/home/spotteddove/text2sql/sql-r1/verl/utils/reward_score/exec_eval.py", line 200, in eval_exec_match
    db_paths = [os.path.join(db_dir, basename) for basename in os.listdir(db_dir) if '.sqlite' in basename]
FileNotFoundError: [Errno 2] No such file or directory: "data/NL2SQL/SynSQL-2.5M/databases/the_table's_domain_appears_to_be_related_to_character_progression_or_rankin"
```

which adds **a single quotation** (') to the filename (the_tables_domain... ---> **the_table's_domain...**), and cause the FileNotFoundError. I ensure that the file `data/NL2SQL/SynSQL-2.5M/databases/the_tables_domain_appears_to_be_related_to_character_progression_or_rankin` exists. 

Have you encounterred such problem during devloping SQL-R1? I am looking forward to hearing from you. Thank you.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Weird FileNotFoundError during single card training #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Weird FileNotFoundError during single card training #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions