StableDiffusion大模型(Dreambooth)云上训练以及安装CODA指定版本

阅读次数:

创建阿里云PAI DSW实例跑kohya

镜像我这里选:
2025-03-06-13-51-58

dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/pai/modelscope:1.10.0-pytorch2.1.0tensorflow2.14.0-gpu-py310-cu118-ubuntu22.04
我这里用默认的镜像,实测截至2024.01.10时,直接拉kohya的github可以直接用,不需要改动cuda之类的操作

安装kohya_ss

kohya_ss仓库地址:
https://github.com/bmaltais/kohya_ss
kohya_ss是个webui训练器,SD web_ui里也有对应的Dreambooth训练插件移植,如果只拿来训练不需要跑完整的sd webui服务,只需要kohya就可以了,kohay也可以练lora
在workspace根目录直接:

git clone https://github.com/bmaltais/kohya_ss.git

2025-03-06-13-52-26

很快就能完成,接着依次执行

cd ./kohya_ss
apt update -y && apt install -y python3-tk
chmod +x ./setup.sh
./setup.sh

虽然镜像里带py310,但是似乎还是要装一下python3-tk
之后安装脚本会自动完成,我大概花了5分钟
2025-03-06-13-52-39

然后运行启动webui

HF_ENDPOINT=https://hf-mirror.com ./gui.sh --headless

2025-03-06-13-52-50

点击这个本地的IP即能点开webui了
HF_ENDPOINT=https://hf-mirror.com是为了防止抱脸会更新卡住而用的镜像网站(我确实因为这个卡过)或者见本站另一篇专门说代理的文章:
https://winotmk.github.io/240109_Linux%E4%B8%8A%E7%9A%84%E5%91%BD%E4%BB%A4%E8%A1%8C%E4%BB%A3%E7%90%86%E5%B7%A5%E5%85%B7/

kohya_ss的Dreambooth训练参数

source model

2025-03-06-13-52-58

我这里用自己上传的模型,可以先上传至阿里云oss再挂载进来,所以这里这样选,然后填模型路径就好了

floders

这个tag里比较简单,没什么好说的
Image folder里写上训练集目录,注意写上的目录底下应该是例如10_ABC目录,然后再放图和txt文件,这个10就step的10,和lora训练时候一样

parameters

2025-03-06-13-53-08

basic

和lora训练设置大同小异,但是参数要比lora小得多,因为Dreambooth比lora性能消耗要大得多而且非常容易过拟合,图出来一滩浆糊,比如我尝试epoch可能10以内就足够,由于文件比较大Save every N epochs我一般也就3-4,其他参数看个人需求吧

samples

这里能填关键词和每多少轮出个预览图,玩玩用
都准备好了就可以点 start training,但webui不会有任何提示..要看之前启webui的终端
这样就是开始训练了:
2025-03-06-13-53-30

不过我第一次成功启动了webui但是点开始训练以后,报过类似这样的错:

The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda-11 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
The following directories listed in your path were found to be non-existent: {PosixPath('//license-pai.cn-shanghai.data.aliyun.com'), PosixPath('http')}
The following directories listed in your path were found to be non-existent: {PosixPath('dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/cloud-dsw/eas-service'), PosixPath('aigc-torch113-cu117-ubuntu22.04-v0.2.1')}
The following directories listed in your path were found to be non-existent: {PosixPath('http'), PosixPath('8088/dsw-301739'), PosixPath('//127.0.0.1')}
The following directories listed in your path were found to be non-existent: {PosixPath('Asia/Shanghai')}
The following directories listed in your path were found to be non-existent: {PosixPath('tcp'), PosixPath('443'), PosixPath('//10.192.0.1')}
The following directories listed in your path were found to be non-existent: {PosixPath('https'), PosixPath('//dsw-cn-shanghai.data.aliyun.com')}
The following directories listed in your path were found to be non-existent: {PosixPath('tcp'), PosixPath('443'), PosixPath('//10.192.0.1')}
The following directories listed in your path were found to be non-existent: {PosixPath('/home/pai/bin/python')}
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
DEBUG: Possible options found for libcudart.so: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}
CUDA SETUP: PyTorch settings found: CUDA_VERSION=118, Highest Compute Capability: 7.0.
CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
/usr/local/lib/python3.10/dist-packages/bitsandbytes/cuda_setup/main.py:166: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!                     If you run into issues with 8-bit matmul, you can try 4-bit quantization: https://huggingface.co/blog/4bit-transformers-bitsandbytes
  warn(msg)
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so...
libcusparse.so.11: cannot open shared object file: No such file or directory
CUDA SETUP: Something unexpected happened. Please compile from source:
git clone https://github.com/TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=118 make cuda11x_nomatmul
python setup.py install
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 187, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/python3.10/runpy.py", line 146, in _get_module_details
    return _get_module_details(pkg_main_name, error)
  File "/usr/lib/python3.10/runpy.py", line 110, in _get_module_details
    __import__(pkg_name)
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/__init__.py", line 6, in <module>
    from . import cuda_setup, utils, research
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/research/__init__.py", line 1, in <module>
    from . import nn
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/research/nn/__init__.py", line 1, in <module>
    from .modules import LinearFP8Mixed, LinearFP8Global
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/research/nn/modules.py", line 8, in <module>
    from bitsandbytes.optim import GlobalOptimManager
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/optim/__init__.py", line 6, in <module>
    from bitsandbytes.cextension import COMPILED_WITH_CUDA
  File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py", line 20, in <module>
    raise RuntimeError('''

以及如果遇到类似:

Could not load dynamic library 'libcudart.so.11.0'

重新装适合的CUDA版本即可解决,如果要装CUDA:

CUDA相关

安装CUDA指定版本

遇到过cuda版本不匹配的问题,记一下配置过程
cuda下载:https://developer.nvidia.com/cuda-downloads
但是有时候需要特定版本:https://developer.nvidia.com/cuda-toolkit-archive
以11.8为例,系统是ubuntu22.04,所以这样选:
2025-03-06-13-53-49

下载安装cuda:

cd /mnt/workspace
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
sudo sh cuda_11.8.0_520.61.05_linux.run

大约3-4GB左右,运行后需要等一会,会弹出交互界面
2025-03-06-13-54-00

这里去掉安装驱动,因为我们已经有驱动了只是想要不同版本的cuda,然后选安装

ps如果遇到装了多份驱动需要卸一个的情况:
https://www.jianshu.com/p/54928967e417

装好以后他会提示:
2025-03-06-13-54-16

需要往 LD_LIBRARY_PATHPATH 里添加两条环境变量

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.8/lib64
export PATH=$PATH:/usr/local/cuda-11.8/bin

之后我使用

python -m bitsandbytes

如果没有报错应该就是好用的

ps如果是windows上的wsl:

export LD_LIBRARY_PATH=/usr/lib/wsl/lib/

切换cuda版本

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.7
export BNB_CUDA_VERSION=117

改环境变量可以手动切换版本(当然得已经装了)

查看cuda版本

nvidia-smi

或者可以:
参考:https://blog.csdn.net/Kefenggewu_/article/details/117675079
默认cuda会装在/usr/local,所以查看安装版本可以这样:

ls -l /usr/local | grep cuda

或者据说可以nvcc -V # (V大写)

本节另外的参考:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md

链接

kohya_ss:
https://github.com/bmaltais/kohya_ss
一个封装好的kohya-docker的镜像
https://github.com/ashleykleynhans/kohya-docker

dreambooth相关介绍:
https://huggingface.co/docs/diffusers/training/dreambooth
https://github.com/google/dreambooth

|
Built with Hugo
Theme Stack designed by Jimmy Customized and extended by Winotmk