[OSS/S3] 对象存储 FAQ
1 概述: 对象存储-常见问题
2 FAQ for 对象存储(OSS/S3)
Q: 火山云(TOS) 客户端的访问差异及示例?
所有客户端 : 支持 内网与外网 访问
endpoint(s3 访问方式 和 非 s3 访问方式; 内网与外网 的 endpoint 是完全不同的)
内网 : "tos-s3-ap-southeast-3.ivolces.com:443" or "tos-s3-ap-southeast-3.ivolces.com" / ...
外网 : "tos-s3-ap-southeast-3.volces.com:443" or "tos-s3-ap-southeast-3.volces.com" / ... (此方式的访问效率较低)
详情参见 : 地域和访问域名(Endpoint) - 火山云/TOS
java: com.amazonaws:aws-java-sdk-*(s3/kms/core/...):1.12.261 : 支持 s3 协议;支持 VistualHost 访问方式,但不支持 PathStyle 访问方式
//基于 AK/SK 创建会话凭据
com.amazonaws.auth.BasicAWSCredentials ossCredentials = new BasicAWSCredentials(accessKey, secretKey);
ClientConfiguration clientConfiguration = new ClientConfiguration();
clientConfiguration.setConnectionTimeout(9999 * 1000);
clientConfiguration.setRequestTimeout(9999 * 1000);
clientConfiguration.setSocketTimeout(9999 * 1000);
clientConfiguration.setProtocol(Protocol.HTTPS);
//创建客户端会话实例
com.amazonaws.services.s3.AmazonS3 ossClient = AmazonS3ClientBuilder.standard()
.withCredentials(new AWSStaticCredentialsProvider(ossCredentials))
.withEndpointConfiguration(new AwsClientBuilder.EndpointConfiguration(endpoint, region))
.withClientConfiguration(clientConfiguration)
//.enablePathStyleAccess()
.withPathStyleAccessEnabled( enablePathStyleAccess = false ) //华为云 OBS : 支持 PathStyle 访问方式; 火山云(TOS) : 不支持 PathStyle 访问方式
.build();
//具体操作-上传文件到 OSS
PutObjectRequest putObjectRequest = new PutObjectRequest(bucketName, objectKey, file);//String bucketName, String key, File file
String name = file.getName();
ObjectMetadata metadata = new ObjectMetadata();
metadata.addUserMetadata( FILE_TYPE , name.substring(name.lastIndexOf(".") + 1));// 记录文件类型
putObjectRequest.setMetadata(metadata);
com.amazonaws.services.s3.model.PutObjectResult result = ossClient.putObject(putObjectRequest);
//具体操作-根据桶名及文件夹名,获取该桶该文件夹的操作对象
ListObjectsRequest lor = new ListObjectsRequest().withBucketName(bucketName).withPrefix("perfFile/");
ObjectListing objectListing = S3.listObjects(lor);
//根据操作对象列出所有文件对象,单个对象使用objectSummary.getKey()即可获取此文件完整路径,配合桶名可以用于操作
for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries()) {
//...
String objectKey = objectSummary.getKey();
}
//具体操作-删除OSS文件
ossClient.deleteObject(bucketName, key);//String key
endpoint(s3 访问方式 和 非 s3 访问方式; 内网与外网 的 endpoint 是完全不同的)内网 : "tos-s3-ap-southeast-3.ivolces.com:443" or "tos-s3-ap-southeast-3.ivolces.com" / ...
外网 : "tos-s3-ap-southeast-3.volces.com:443" or "tos-s3-ap-southeast-3.volces.com" / ...
详情参见 : 地域和访问域名(Endpoint) - 火山云/TOS
region"ap-southeast-3" / "cn-beijing" / "cn-shanghai" / ...
python: tos.TosClientV2 : 不支持 s3 协议
- 安装方式 :
pip3 install tos
- Volcengine TOS SDK for Python | https://pypi.org/project/tos/
- 火山引擎 TOS Python SDK | https://github.com/volcengine/ve-tos-python-sdk/blob/main/README-zh.md
- 示例代码
import os;
# from obs import ObsClient;
import tos;
from tos import TosClientV2 as TosClient;
import shutil;
import traceback;
""" 本地试验脚本
地域和访问域名(Endpoint) | https://www.volcengine.com/docs/6349/107356?lang=zh
Volcengine TOS SDK for Python | https://pypi.org/project/tos/
pip install tos
火山引擎 TOS Python SDK | https://github.com/volcengine/ve-tos-python-sdk/blob/main/README-zh.md
"""
AccessKey='xxx'
SecretKey='xxx'
#Endpoint='obs.cn-north-4.myhuaweicloud.com'
# 注: 火山云 , python tos 模块 ,仅支持 非 S3 的 endpoint; java 客户端,支持 s3 endpoint
# 内网(非S3) : tos-s3-ap-southeast-3.ivolces.com
# Endpoint='tos-ap-southeast-3.ivolces.com'
# 公网(非S3)
Endpoint='tos-ap-southeast-3.volces.com'
Region="ap-southeast-3"
BucketName="xxxEnv-tos-bigdata-private"
OssDirectory="XXXEnv/ODS/xxx/xxx/"
# 需注意: 前面不要加 '/' 根目录前缀
#LocalSourcePath = "/data/xxxx/local_data/source/xxxDayNumber/xxxDeviceIdHash"
LocalSourcePath="E:/tmp/xxx-platform" + "/data/xxxx/local_data/source/normal_data/xxxDayNumber/xxxDeviceIdHash"
# 或 : "E:\\tmp\\" + ...
# "obs://xxxEnv-tos-bigdata-private/XXXEnv/ODS/xxx/xxx/xxxDayNumber/XxxDeviceId/e7463ffdce5ced753a1d77c246fae273-0"
#ObjectKey = "xxxEnv/ODS/xxxxx/xxxDayNumber/xxxDeviceModelCode/XxxDeviceId/e7463ffdce5ced753a1d77c246fae273-0"
ObjectKey = "xxxEnv/ODS/xxx/xxxDayNumber/xxxDeviceModelCode/XxxDeviceId/e7463ffdce5ced753a1d77c246fae273-0"
#LocalSinkPath = "E:/tmp/xxx-platform" + "/xxx/local_data/sink/normal_data/xxxDayNumber/xxxDeviceIdHash/"
LocalSinkPath='E:\\work_data\\xxx\\XxxDeviceId\\e7463ffdce5ced753a1d77c246fae273-0'
#ossClient = ObsClient(access_key_id=AccessKey, secret_access_key=SecretKey,server=Endpoint)
ossClient = TosClient(ak=AccessKey, sk=SecretKey, endpoint=Endpoint, region=Region)
### CASE : 从 OBS 指定目录批量下载 对象文件到本地文件夹下
# https://xxxEnv-tos-bigdata-private.tos-s3-ap-southeast-3.ivolces.com/xxxEnv/ODS/xxx/xxxDayNumber/xxxDeviceModelCode/XxxDeviceId/e7463ffdce5ced753a1d77c246fae273-0
#local_path='/data/xxx/xxx/local_data/source/normal_data/xxxDayNumber/xxxDeviceIdHash'
local_path=LocalSourcePath
def exit_when_exception():
print( traceback.format_exc() )
shutil.rmtree(LocalSourcePath)
os._exit(1)
try:
# 创建 TosClientV2 对象,对桶和对象的操作都通过 TosClientV2 实现
ossClient = tos.TosClientV2(AccessKey, SecretKey, Endpoint, Region)
# 列举指定桶下所有对象
truncated = True
continuation_token = ''
total_files = 0;
success_files = 0;
failure_files = 0;
while truncated:
result = ossClient.list_objects_type2(BucketName, prefix= OssDirectory, continuation_token=continuation_token)
total_files = len(result.contents);
for item in result.contents:
if item.key.endswith("/") == False: # 文件对象,而非目录对象
fileName = os.path.basename( item.key );
try :
file_response = ossClient.get_object_to_file( BucketName, item.key, LocalSourcePath + "/" + fileName) # 会根据前缀在本地创建文件夹
success_files += 1;
# print("下载成功:" + item.key)
except Exception as e:
failure_files += 1;
print(f"download file fail!objectKey:{item.key}, exception:{e}")
else: # 目录对象, eg: 'xxxEnv/ODS/xxx/normal_data/xxxDayNumber/xxxDeviceIdHash/'
total_files -= 1;
truncated = result.is_truncated
continuation_token = result.next_continuation_token
# resp = obsClient.downloadFiles(bucketName=BucketName, prefix=OssDirectory, downloadFolder=LocalSourcePath)
# result = "DownloadFiles summary : total_task:%d, success:%d ,failure:%d" % (resp.total_tasks, resp.successful_tasks, resp.failed_tasks)
result = "DownloadFiles summary | total_task:%d, success:%d ,failure:%d" % ( total_files,success_files, failure_files )
print(result)
except tos.exceptions.TosClientError as e:
# 操作失败,捕获客户端异常,一般情况为非法请求参数或网络异常
print('fail with client error, message:{}, cause: {}'.format(e.message, e.cause))
exit_when_exception();
except tos.exceptions.TosServerError as e:
# 操作失败,捕获服务端异常,可从返回信息中获取详细错误信息
print('fail with server error, code: {}'.format(e.code))
# request id 可定位具体问题,强烈建议日志中保存
print('error with request id: {}'.format(e.request_id))
print('error with message: {}'.format(e.message))
print('error with http code: {}'.format(e.status_code))
print('error with ec: {}'.format(e.ec))
print('error with request url: {}'.format(e.request_url))
exit_when_exception();
except Exception as e:
print('fail with unknown error: {}'.format(e))
exit_when_exception();
"""
### CASE : 从 本地文件 上传到 OBS 指定目录下 (OBS会自动创建对应路径)
try:
#resp = ossClient.putFile(BucketName, ObjectKey, LocalSinkPath)
resp = ossClient.put_object_from_file(BucketName, ObjectKey, LocalSinkPath)
print(f"resp.response:{resp.resp}")
except:
import traceback
print(traceback.format_exc())
# shutil.rmtree(LocalSourcePath)
# shutil.rmtree(LocalSinkPath)
os._exit(1)
"""
python : pyarrow 的 pyarrow.fs.S3FileSystem 子模块 : 支持 s3 协议;支持 VistualHost 访问方式,但不支持 PathStyle 访问方式
#!/usr/bin/env python3.9
# -*- coding: utf-8 -*-
from pyarrow import fs
import pyarrow as pa
import pyarrow.parquet as pq
# 读取指定目录下的 若干 parquet 文件 by `pq.read_table`
#table = pq.read_table('/data/xxx/xxx/local_data/source/normal_data/xxxDayNumber/xxxDeviceIdHash')
table = pq.read_table('E:/work_data/xxx/xxx/xxxDayNumber/xxxDeviceIdHash')
file_options = pa.dataset.ParquetFileFormat().make_write_options(compression='zstd')
s3= fs.S3FileSystem(
region="ap-southeast-3"
, endpoint_override='tos-s3-ap-southeast-3.volces.com:443'
, access_key='xxx'
, secret_key='xxx'
, force_virtual_addressing=True # 设置为 True 即指定为 virtual 模式 (划重点) | https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html
);
# 检查 查询权限
try:
# 尝试列出桶内内容
file_info = s3.get_file_info( fs.FileSelector("xxxEnv-tos-bigdata-private/xxxxxDir", recursive=False))
print("✅ 基础连接成功!能够访问存储桶。")
except Exception as e:
print(f"❌ 权限验证失败: {e}")
# 检查 写入权限
test_path = "xxxEnv-tos-bigdata-private/test_connection.txt"
try:
with s3.open_output_stream(test_path) as stream:
stream.write(b"connection test")
print(f"✅ 写入权限验证成功!文件已创建: {test_path}")
# 可选:测试完成后删除测试文件
# s3.delete_file(test_path)
except Exception as e:
print(f"❌ 写入失败: 请检查 s3:PutObject 权限。错误详情:\n{e}")
pa.dataset.write_dataset(
table
, base_dir='xxxEnv-tos-bigdata-private/xxxEnv/ODS/xxx/xxxDayNumber'
, partitioning=['key1','key2']
, basename_template='13acbf6ba3fe361bb48f262c3c0148a6-{i}'
, format='parquet'
, max_partitions=1000000
, existing_data_behavior='overwrite_or_ignore'
, max_open_files=1000000
, file_options=file_options
, filesystem=s3
)
rows=table.num_rows
Q: 火山云(TOS)不支持 PathStyle 访问模式,仅支持 VirtualHost 访问模式,导致Java/Python客户端中报 InvalidPathAccess 错误,如何配置解决?
问题描述
java等支持 s3 协议的客户端报:InvalidPathAccess错误
pyarrow报:AWS Error ACCESS_DENIED during HeadBucket operation: No response body.
table = pq.read_table('/data/source/normal_data/xxxDayNumber/xxxDeviceIdHash')
file_options = pa.dataset.ParquetFileFormat().make_write_options(compression='zstd')
pa.dataset.write_dataset(table, base_dir='xxxEnv-tos-bigdata-private/xxxEnv/ODS/xxx/xxxDayNumber',partitioning=['key1','key2'],basename_template='13acbf6ba3fe361bb48f262c3c0148a6-{i}',format='parquet',max_partitions=1000000,existing_data_behavior='overwrite_or_ignore',max_open_files=1000000,file_options=file_options,filesystem=s3)
异常日志:
# python3 xxxx.py
...
File "/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py", line 1035, in write_dataset
_filesystemdataset_write(
File "pyarrow/_dataset.pyx", line 4177, in pyarrow._dataset._filesystemdataset_write
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: When testing for existence of bucket 'xxxEnv-tos-bigdata-private': AWS Error ACCESS_DENIED during HeadBucket operation: No response body.
root@xxx-ecs:~# pip list | grep -i pyarrow
pyarrow 23.0.0
原因分析
- 火山云(TOS) : 不支持 PathStyle 访问模式,仅支持 VirtualHost 访问模式,将导致Java/Python客户端中报 InvalidPathAccess 错误
- 华为云(OBS) : 支持 PathStyle 访问模式
解决方法
- 参见本文: "Q: 火山云(TOS) 客户端的访问差异及示例?"
- 参见官方: 使用 PathStyle 方式访问 TOS 时,报错 InvalidPathAccess - 火山云/TOS
Y 推荐文献
- 使用 PathStyle 方式访问 TOS 时,报错 InvalidPathAccess - 火山云/TOS
- 地域和访问域名(Endpoint) - 火山云/TOS
- pyarrow.fs.S3FileSystem - Apache Arrow
X 参考文献
本文作者:
千千寰宇
本文链接: https://chuna2.787528.xyz/johnnyzen
关于博文:评论和私信会在第一时间回复,或直接私信我。
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
日常交流:大数据与软件开发-QQ交流群: 774386015 【入群二维码】参见左下角。您的支持、鼓励是博主技术写作的重要动力!
本文链接: https://chuna2.787528.xyz/johnnyzen
关于博文:评论和私信会在第一时间回复,或直接私信我。
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
日常交流:大数据与软件开发-QQ交流群: 774386015 【入群二维码】参见左下角。您的支持、鼓励是博主技术写作的重要动力!

浙公网安备 33010602011771号