水处理网站模板,专业营销的网站建设公司排名,内江网站开发0832hdsj,做研学的网站分享自己在学习 PyTorch 源码时阅读过的资料。本文重点关注 PyTorch 的核心数据结构 Tensor 的设计与实现。因为 PyTorch 不同版本的源码实现有所不同#xff0c;所以笔者在整理资料时尽可能按版本号升序#xff0c;版本号见标题前[]。最新版本的源码实现还请查看 PyTorch 仓… 分享自己在学习 PyTorch 源码时阅读过的资料。本文重点关注 PyTorch 的核心数据结构 Tensor 的设计与实现。因为 PyTorch 不同版本的源码实现有所不同所以笔者在整理资料时尽可能按版本号升序版本号见标题前[]。最新版本的源码实现还请查看 PyTorch 仓库。更多内容请参考 Ubuntu 22.04 LTS 源码编译安装 PyTorchpytorch/CONTRIBUTING.md 机翻PyTorch 源码学习PyTorch 源码学习阅读经验 代码结构 文章目录 通过类图理解 Tensor 的设计更多关于 c10::intrusive_ptr_target、TensorImpl 和 StorageImpl 的分析自顶向下探索 Tensor 的实现及内存分配aten/src/ATen/CheckpointTensorImpl.cppaten/src/ATen/CheckpointTensorImpl.haten/src/ATen/templates/TensorBody.hc10/core/TensorImpl.hc10/core/TensorImpl.cppc10/core/Storage.hc10/core/StorageImpl.hc10/core/Allocator.hc10/util/UniqueVoidPtr.hc10/cuda/CUDACachingAllocator.hc10/cuda/CUDACachingAllocator.cppvoid* raw_alloc(size_t nbytes);void* raw_alloc_with_stream(size_t nbytes, cudaStream_t stream);raw_delete(void* ptr);void* getBaseAllocation(void *ptr, size_t *size); 待更新…… 通过类图理解 Tensor 的设计 关于类图UML之类图关系继承、实现、依赖、关联、聚合、组合-CSDN博客 下图来源自[1.0.0] PyTorch的Tensor上当作一个简版的类图。该博客写得较早但也具有很高的参考价值同系列的博客还有 2019-01-19PyTorch的编译系统2019-02-14PyTorch ATen代码的动态生成2019-02-18PyTorch Autograd代码的动态生成2019-02-27PyTorch的初始化2019-03-06PyTorch的Tensor上2019-05-11PyTorch的Tensor中2019-06-23PyTorch的Tensor下2019-03-16PyTorch的cpp代码生成2019-04-22再谈PyTorch的初始化上2019-04-23再谈PyTorch的初始化中2019-04-24再谈PyTorch的初始化下2019-04-30PyTorch的动态图(上)2019-05-16PyTorch的动态图(下) #垂直表示继承水平表示被包含,()表示为一个类
DataPtr - StorageImpl - Storage - (TensorImpl) - (Tensor)| |v v(Tensor) - Variable::Impl Variable - AutogradMeta - (TensorImpl)其中Storage 和 StorageImpl 之间、TensorImpl 和 Tensor 之间都使用了 Bridge 设计模式。 桥接Bridge设计模式是一种结构型设计模式它旨在将抽象部分与实现部分分离以便两者可以独立地变化。这样可以使一个类的多个维度变化独立开来从而减少类之间的耦合度。桥接模式通过使用组合而不是继承的方式来达到这个目的。 Storage 和 StorageImpl 的桥接模式实现 抽象部分Abstraction这里是 Storage 类。它提供了一个高级别的接口来操作和管理数据存储但不直接实现存储的细节。实现部分Implementor这里是 StorageImpl 类。它定义了存储的具体实现细节包括数据类型、数据指针、元素数量等。组合关系Storage 中包含一个指向 StorageImpl 的智能指针 c10::intrusive_ptrStorageImpl。这意味着 Storage 并不直接实现数据存储而是依赖 StorageImpl 来实现。storage_impl_ 是桥接接口即实现部分的一个实例Storage 通过它来操作实际的数据存储。 使用桥接模式有以下几个好处 分离接口和实现通过将存储的接口Storage与存储的实现StorageImpl分离允许两者独立变化。例如可以改变存储实现的细节而不影响存储接口反之亦然。提高灵活性和可扩展性可以很容易地添加新的存储实现而不改变现有的存储接口。同样可以扩展存储接口而不改变存储实现。减少耦合度接口和实现之间的低耦合度提高了代码的可维护性和可测试性。 下图来源自[1.10.0] [Pytorch 源码阅读] —— Tensor C相关实现 文中有更多关于 c10::intrusive_ptr_target 类、TensorImpl 类和 StorageImpl 类源码分析的内容。 c10::intrusive_ptr 的初始化需要 intrusive_ptr_target 或者其子类。TensorImpl 和 StorageImpl 两个类分别为 intrusive_ptr_target 的子类然后StorageImpl 主要负责 tensor 的实际物理内存相关的操作设置空间配置器获取数据指针以及占用物理空间大小等Storage 仅仅是对 StorageImpl 直接包了一下直接调用的是 StorageImpl 的相关成员函数。TensorImpl 是 Tensor 类实现的主要依赖类其初始化就需要依赖 Storage 类所以上面说Tensor TensorImpl StorgaeImpl。 下图来源自[2.0.0] Tensor的组织结构 下图来源自[unknown] pytorch源码学习-Tensor-01 下图来源自[unknown] Pytorch Tensor/TensorImpl/Storage/StorageImpl及相关内容 pytorch intrusive_ptrpytorch Device/DeviceTypePytorch TypeMeta Tensor, WeakTensor - aten/src/ATen/core/Tensor.hTensorImpl - c10/core/TensorImpl.hStorage - c10/core/Storage.hStorageImpl - c10/core/StorageImpl.hDataPtr, Allocator,AllocatorRegisterer - c10/core/Allocator.hUniqueVoidPtr - c10/util/UniqueVoidPtr.h
更多关于 c10::intrusive_ptr_target、TensorImpl 和 StorageImpl 的分析
Tensor源码分析与复现1 Tensor源码分析与复现2★★★【翻译】PyTorch中的intrusive_ptrpytorch基于intrusive_ptr_target实现的核心数据结构介绍
自顶向下探索 Tensor 的实现及内存分配 下面的内容源于笔者读研期间的课题研究。代码可以参考 DTR 版本的 PyTorch 1.7.0。 从 CheckpointTensorImpl.cpp 里的 memory 函数开始探索
aten/src/ATen/CheckpointTensorImpl.cpp 具体见aten/src/ATen/CheckpointTensorImpl.cpp #include ATen/CheckpointTensorImpl.h - aten/src/ATen/CheckpointTensorImpl.h
#include ATen/Logger.h
#include c10/cuda/CUDACachingAllocator.h - c10/cuda/CUDACachingAllocator.hinline size_t memory(const Tensor t) {if (! t.has_storage()) {return 0;}auto storage t.storage();size_t res storage.nbytes();memory_sum res;memory_max std::max(memory_max, res);memory_count 1;return res;
}long current_memory() {auto device_stat c10::cuda::CUDACachingAllocator::getDeviceStats(0);return device_stat.allocated_bytes[0].current;
}aten/src/ATen/CheckpointTensorImpl.h 具体见aten/src/ATen/CheckpointTensorImpl.h #include c10/core/Backend.h
#include c10/core/MemoryFormat.h
#include c10/core/Storage.h - c10/core/Storage.h
#include c10/core/TensorOptions.h
#include c10/core/DispatchKeySet.h
#include c10/core/impl/LocalDispatchKeySet.h
#include c10/core/CopyBytes.h#include c10/util/Exception.h
#include c10/util/Optional.h
#include c10/util/Flags.h
#include c10/util/Logging.h
#include c10/util/python_stub.h
#include c10/core/TensorImpl.h - c10/core/TensorImpl.h
#include ATen/Tensor.h - aten/src/ATen/Tensor.h - aten/src/ATen/templates/TensorBody.h
#include ATen/ATen.h - aten/src/ATen/ATen.haten/src/ATen/templates/TensorBody.h 具体见aten/src/ATen/templates/TensorBody.h #include c10/core/Device.h
#include c10/core/Layout.h
#include c10/core/MemoryFormat.h
#include c10/core/QScheme.h
#include c10/core/Scalar.h
#include c10/core/ScalarType.h
#include c10/core/Storage.h - c10/core/Storage.h
#include ATen/core/TensorAccessor.h
#include c10/core/TensorImpl.h - c10/core/TensorImpl.h
#include c10/core/UndefinedTensorImpl.h
#include c10/util/Exception.h
#include c10/util/Deprecated.h
#include c10/util/Optional.h
#include c10/util/intrusive_ptr.h
#include ATen/core/DeprecatedTypePropertiesRegistry.h
#include ATen/core/DeprecatedTypeProperties.h
#include ATen/core/NamedTensor.h
#include ATen/core/QuantizerBase.h
#include torch/csrc/WindowsTorchApiMacro.hclass CAFFE2_API Tensor {public:bool defined() const {return impl_;}bool has_storage() const {return defined() impl_-has_storage();}const Storage storage() const {return impl_-storage();}void* data_ptr() const {return this-unsafeGetTensorImpl()-data();}template typename TT * data_ptr() const;protected:c10::intrusive_ptrTensorImpl, UndefinedTensorImpl impl_;
};c10/core/TensorImpl.h 具体见c10/core/TensorImpl.h #include c10/core/Backend.h
#include c10/core/MemoryFormat.h
#include c10/core/Storage.h - c10/core/Storage.h
#include c10/core/TensorOptions.h
#include c10/core/DispatchKeySet.h
#include c10/core/impl/LocalDispatchKeySet.h
#include c10/core/CopyBytes.h#include c10/util/Exception.h
#include c10/util/Optional.h
#include c10/util/Flags.h
#include c10/util/Logging.h
#include c10/util/python_stub.hstruct C10_API TensorImpl : public c10::intrusive_ptr_target {public:/*** Return a reference to the sizes of this tensor. This reference remains* valid as long as the tensor is live and not resized.*/virtual IntArrayRef sizes() const;/*** True if this tensor has storage. See storage() for details.*/virtual bool has_storage() const;/*** Return the underlying storage of a Tensor. Multiple tensors may share* a single storage. A Storage is an impoverished, Tensor-like class* which supports far less operations than Tensor.** Avoid using this method if possible; try to use only Tensor APIs to perform* operations.*/virtual const Storage storage() const;/*** Return the size of a single element of this tensor in bytes.*/size_t itemsize() const {TORCH_CHECK(dtype_initialized(),Cannot report itemsize of Tensor that doesnt have initialized dtype (e.g., caffe2::Tensor x(CPU), prior to calling mutable_dataT() on x));return data_type_.itemsize();}protected:Storage storage_;
};c10/core/TensorImpl.cpp 具体见c10/core/TensorImpl.cpp #include c10/core/TensorImpl.h - c10/core/TensorImpl.hIntArrayRef TensorImpl::sizes() const {return sizes_;
}bool TensorImpl::has_storage() const {return storage_;
}const Storage TensorImpl::storage() const {return storage_;
}c10/core/Storage.h 具体见c10/core/Storage.h #include c10/core/StorageImpl.h - c10/core/StorageImpl.hstruct C10_API Storage {public:size_t nbytes() const {return storage_impl_-nbytes();}// get() use here is to get const-correctnessvoid* data() const {return storage_impl_.get()-data();}at::DataPtr data_ptr() {return storage_impl_-data_ptr();}const at::DataPtr data_ptr() const {return storage_impl_-data_ptr();}at::Allocator* allocator() const {return storage_impl_.get()-allocator();}protected:c10::intrusive_ptrStorageImpl storage_impl_;
};c10/core/StorageImpl.h 具体见c10/core/StorageImpl.h #include c10/core/Allocator.h - c10/core/Allocator.h
#include c10/core/ScalarType.h#include c10/util/intrusive_ptr.hstruct C10_API StorageImpl final : public c10::intrusive_ptr_target {public:size_t nbytes() const {return size_bytes_;}at::DataPtr data_ptr() {return data_ptr_;};const at::DataPtr data_ptr() const {return data_ptr_;};// TODO: Return const ptr eventually if possiblevoid* data() {return data_ptr_.get();}void* data() const {return data_ptr_.get();}at::Allocator* allocator() {return allocator_;}const at::Allocator* allocator() const {return allocator_;};private:DataPtr data_ptr_;size_t size_bytes_;Allocator* allocator_;
};c10/core/Allocator.h 具体见c10/core/Allocator.h #include c10/core/Device.h
#include c10/util/Exception.h
#include c10/util/ThreadLocalDebugInfo.h
#include c10/util/UniqueVoidPtr.h - c10/util/UniqueVoidPtr.hclass C10_API DataPtr {private:c10::detail::UniqueVoidPtr ptr_;Device device_;public:void* get() const {return ptr_.get();}};struct C10_API Allocator {virtual ~Allocator() default;virtual DataPtr allocate(size_t n) const 0;};c10/util/UniqueVoidPtr.h 具体见c10/util/UniqueVoidPtr.h class UniqueVoidPtr {private:// Lifetime tied to ctx_void* data_;std::unique_ptrvoid, DeleterFnPtr ctx_;public:void clear() {ctx_ nullptr;data_ nullptr;}void* get() const {return data_;}};c10/cuda/CUDACachingAllocator.h 具体见c10/cuda/CUDACachingAllocator.h #include c10/cuda/CUDAStream.h
#include c10/core/Allocator.h - c10/core/Allocator.h
#include c10/cuda/CUDAMacros.h
#include c10/util/Registry.hnamespace CUDACachingAllocator {struct Stat {int64_t current 0;int64_t peak 0;int64_t allocated 0;int64_t freed 0;
};enum struct StatType : uint64_t {AGGREGATE 0,SMALL_POOL 1,LARGE_POOL 2,NUM_TYPES 3 // remember to update this whenever a new stat type is added
};typedef std::arrayStat, static_castsize_t(StatType::NUM_TYPES) StatArray;// Struct containing memory allocator summary statistics for a device.
struct DeviceStats {// COUNT: allocations requested by client codeStatArray allocation;// COUNT: number of allocated segments from cudaMalloc().StatArray segment;// COUNT: number of active memory blocks (allocated or used by stream)StatArray active;// COUNT: number of inactive, split memory blocks (unallocated but cant be released via cudaFree)StatArray inactive_split;// SUM: bytes requested by client codeStatArray allocated_bytes;// SUM: bytes reserved by this memory allocator (both free and used)StatArray reserved_bytes;// SUM: bytes within active memory blocksStatArray active_bytes;// SUM: bytes within inactive, split memory blocksStatArray inactive_split_bytes;// COUNT: total number of failed calls to CUDA malloc necessitating cache flushes.int64_t num_alloc_retries 0;// COUNT: total number of OOMs (i.e. failed calls to CUDA after cache flush)int64_t num_ooms 0;
};// Struct containing info of an allocation block (i.e. a fractional part of a cudaMalloc)..
struct BlockInfo {int64_t size 0;bool allocated false;bool active false;
};// Struct containing info of a memory segment (i.e. one contiguous cudaMalloc).
struct SegmentInfo {int64_t device 0;int64_t address 0;int64_t total_size 0;int64_t allocated_size 0;int64_t active_size 0;bool is_large false;std::vectorBlockInfo blocks;
};C10_CUDA_API void* raw_alloc(size_t nbytes);
C10_CUDA_API void* raw_alloc_with_stream(size_t nbytes, cudaStream_t stream);
C10_CUDA_API void raw_delete(void* ptr);C10_CUDA_API Allocator* get();
C10_CUDA_API void init(int device_count);
C10_CUDA_API void emptyCache();
C10_CUDA_API void cacheInfo(int dev_id, size_t* cachedAndFree, size_t* largestBlock);
C10_CUDA_API void* getBaseAllocation(void *ptr, size_t *size);
C10_CUDA_API void recordStream(const DataPtr, CUDAStream stream);
C10_CUDA_API DeviceStats getDeviceStats(int device);
C10_CUDA_API void resetAccumulatedStats(int device);
C10_CUDA_API void resetPeakStats(int device);
C10_CUDA_API std::vectorSegmentInfo snapshot();C10_CUDA_API std::mutex* getFreeMutex();C10_CUDA_API std::shared_ptrvoid getIpcDevPtr(std::string handle);
} // namespace CUDACachingAllocatorc10/cuda/CUDACachingAllocator.cpp 具体见c10/cuda/CUDACachingAllocator.cpp #include c10/cuda/CUDACachingAllocator.h - c10/cuda/CUDACachingAllocator.h#include c10/cuda/CUDAGuard.h
#include c10/cuda/CUDAException.h
#include c10/cuda/CUDAFunctions.h
#include c10/util/UniqueVoidPtr.h - c10/util/UniqueVoidPtr.hvoid* raw_alloc(size_t nbytes);
// 实现
void* raw_alloc(size_t nbytes) {if (nbytes 0) {return nullptr;}int device;C10_CUDA_CHECK(cudaGetDevice(device));void* r nullptr;caching_allocator.malloc(r, device, nbytes, cuda::getCurrentCUDAStream(device));return r;
}---/** allocates a block which is safe to use from the provided stream 从提供的流中分配一个可以安全使用的块* THCCachingAllocator 类的成员函数* 被 void* raw_alloc 调用
*/
void malloc(void** devPtr, int device, size_t size, cudaStream_t stream) {TORCH_INTERNAL_ASSERT(0 device device device_allocator.size(),Allocator not initialized for device ,device,: did you call init?);// 调用device_allocator的分配函数并且把新建的block加入到add_allocated_block中。Block* block device_allocator[device]-malloc(device, size, stream);add_allocated_block(block);*devPtr (void*)block-ptr;
}---/*** 被 THCCachingAllocator 类的成员函数 void malloc 调用* DeviceCachingAllocator 类的成员函数
*/
Block* malloc(int device, size_t size, cudaStream_t stream)
{std::unique_lockstd::recursive_mutex lock(mutex);// process outstanding cudaEventsprocess_events();// 分配512 byte倍数的数据size round_size(size);// 寻找合适的内存池进行分配auto pool get_pool(size);// 根据分配segment分配分配空间const size_t alloc_size get_allocation_size(size);// 把需要的数据放入params中尤其是size、alloc_sizeAllocParams params(device, size, stream, pool, alloc_size, stats);// 设置标志其中stat_types包括三个标志分别针对AGGREGATE、SMALL_POOL以及LARGE_POOL分别有bitset进行赋值true of falseparams.stat_types[static_castsize_t(StatType::AGGREGATE)] true;params.stat_types[static_castsize_t(get_stat_type_for_pool(pool))] true;// 最为核心的部分包括了四个小部分。bool block_found // Search pool// 从对应大小的Pool中搜索出所需size的数据并分配。get_free_block(params)// Trigger callbacks and retry search 手动进行一波垃圾回收回收掉没人用的 Block再调用 get_free_block|| (trigger_free_memory_callbacks(params) get_free_block(params))// Attempt allocate// Allocator 在已有的 Block 中找不出可分配的了就调用 cudaMalloc 创建新的 Block。|| alloc_block(params, false)// Free all non-split cached blocks and retry alloc. 释放所有非分割缓存块并重试分配。// 如果无法分配合理的空间那么系统会调用free_cached_blocks()函数先将cache释放掉然后再重新分配。|| (free_cached_blocks() alloc_block(params, true));// 如果无法重复使用指针也没有额外的资源分配空间。// 该部分处理分配未成功的部分。如果走到了这里那程序就意味着没救了剩下的就只有崩溃。TORCH_INTERNAL_ASSERT((!block_found params.err ! cudaSuccess) || params.block);if (!block_found) {if (params.err cudaErrorMemoryAllocation) {size_t device_free;size_t device_total;C10_CUDA_CHECK(cudaMemGetInfo(device_free, device_total));stats.num_ooms 1;// total capacity: total global memory on GPU// already allocated: memory allocated by the program using the// caching allocator// free: free memory as reported by the CUDA API// cached: memory held by the allocator but not used by the program//// The allocated amount does not include memory allocated outside// of the caching allocator, such as memory allocated by other programs// or memory held by the driver.//// The sum of allocated free cached may be less than the// total capacity due to memory held by the driver and usage by other// programs.//// Note that at this point free_cached_blocks has already returned all// possible cached memory to the driver. The only remaining cached// memory is split from a larger block that is partially in-use.TORCH_CHECK_WITH(CUDAOutOfMemoryError, false,CUDA out of memory. Tried to allocate , format_size(alloc_size), // 使内存分配不足的最后一颗稻草。 (GPU , device, ; ,format_size(device_total), total capacity; , // GPU设备的总显存大小该值来源于cudaMemGetInfo(device_free, device_total)而该函数能返回gpu中的free与total显存的量。format_size(stats.allocated_bytes[static_castsize_t(StatType::AGGREGATE)].current), already allocated; , // 表示使用cache分配器已经分配的数据的量对应malloc中的update_stat_array(stats.allocated_bytes, block-size, params.stat_types);format_size(device_free), free; , // 为free显存的量format_size(stats.reserved_bytes[static_castsize_t(StatType::AGGREGATE)].current), reserved in total by PyTorch)); // 表示PyTorch中真正分配与cache后的数据就是该值减去“已经分配的值stats.allocated_bytes”就是暂存在pool中的物理上已经分配但是逻辑上没有被使用的总显存大小。} else {C10_CUDA_CHECK(params.err);}}Block* block params.block;Block* remaining nullptr;TORCH_INTERNAL_ASSERT(block);const bool already_split block-is_split();// block分裂针对get_free_block以及alloc_block情况复用cache的指针以及重新分配if (should_split(block, size)) {remaining block;// 新建一个block其大小为size而不是alloc_size因为alloc_size实际大小过大需要分裂block new Block(device, stream, size, pool, block-ptr);// 在原来的block链中间插入新的block而把原来的block转化为remaining添加到新block的后面block-prev remaining-prev;if (block-prev) {block-prev-next block;}block-next remaining;remaining-prev block;remaining-ptr static_castchar*(remaining-ptr) size;// 将remaining块缩小remaining-size - size;pool.insert(remaining);if (already_split) {// An already-split inactive block is being shrunk by size bytes.update_stat_array(stats.inactive_split_bytes, -block-size, params.stat_types);} else {// A new split inactive block is being created from a previously unsplit block,// size remaining-size bytes.update_stat_array(stats.inactive_split_bytes, remaining-size, params.stat_types);update_stat_array(stats.inactive_split, 1, params.stat_types);}} else if (already_split) {// An already-split block is becoming activeupdate_stat_array(stats.inactive_split_bytes, -block-size, params.stat_types);update_stat_array(stats.inactive_split, -1, params.stat_types);}block-allocated true;// active_blocks中存储的是正在使用的blockinsert表示将新建立的block插入到这个集合中active_blocks.insert(block);c10::reportMemoryUsageToProfiler(block, block-size, c10::Device(c10::DeviceType::CUDA, device));// 以此保存内存分配次数、内存分配byte大小、正在使用的数据个数、正在使用的数据大小update_stat_array(stats.allocation, 1, params.stat_types);update_stat_array(stats.allocated_bytes, block-size, params.stat_types);update_stat_array(stats.active, 1, params.stat_types);update_stat_array(stats.active_bytes, block-size, params.stat_types);return block;
}---std::mutex mutex;// allocated blocks by device pointer 通过设备指针分配块
// 在缓存分配器中跟踪分配的内存块。
/**
这行代码声明了一个名为 allocated_blocks 的 std::unordered_map 容器。
这个哈希表将 void* 类型的键在本例中是设备指针指向分配的内存映射到 Block* 类型的值
Block 结构体代表分配的内存块的信息。
std::unordered_map 基于哈希表实现提供了平均常数时间复杂度的查找、插入和删除操作。
*/
std::unordered_mapvoid*, Block* allocated_blocks;/*** THCCachingAllocator 类的成员函数* 将新分配的内存块添加到 allocated_blocks 哈希表中。** 被 THCCachingAllocator 类的成员函数 void malloc 调用
*/
void add_allocated_block(Block* block) {std::lock_guardstd::mutex lock(mutex);allocated_blocks[block-ptr] block;
}void* raw_alloc_with_stream(size_t nbytes, cudaStream_t stream);
// 实现
void* raw_alloc_with_stream(size_t nbytes, cudaStream_t stream) {if (nbytes 0) {return nullptr;}int device;C10_CUDA_CHECK(cudaGetDevice(device));void* r nullptr;// 和 id* raw_alloc(size_t nbytes) 的实现区别在指定 streamcaching_allocator.malloc(r, device, nbytes, stream); return r;
}raw_delete(void* ptr);
// void raw_delete(void* ptr); 的实现
void raw_delete(void* ptr) {caching_allocator.free(ptr);
}---/*** THCCachingAllocator 类的成员函数* 被 void raw_delete 调用
*/
void free(void* ptr) {if (!ptr) {return;}Block* block get_allocated_block(ptr, true /* remove */);if (!block) {AT_ERROR(invalid device pointer: , ptr);}device_allocator[block-device]-free(block);
}---/*** THCCachingAllocator 的成员函数* 被 void free 调用
*/
Block* get_allocated_block(void *ptr, bool removefalse) {std::lock_guardstd::mutex lock(mutex);auto it allocated_blocks.find(ptr);if (it allocated_blocks.end()) {return nullptr;}Block* block it-second;if (remove) {allocated_blocks.erase(it);}return block;
}---/*** 被 THCCachingAllocator 的成员函数 void free 调用
*/
void free(Block* block)
{std::lock_guardstd::recursive_mutex lock(mutex);block-allocated false;c10::reportMemoryUsageToProfiler(block, -block-size, c10::Device(c10::DeviceType::CUDA, block-device));// 更新全局的记录StatTypes stat_types;stat_types[static_castsize_t(StatType::AGGREGATE)] true;stat_types[static_castsize_t(get_stat_type_for_pool(*(block-pool)))] true;update_stat_array(stats.allocation, -1, {stat_types});update_stat_array(stats.allocated_bytes, -block-size, {stat_types});// 判断stream是不是空的if (!block-stream_uses.empty()) {// stream_uses不是空则进入insert_events(block);} else {// 是空的进入free_block(block);}
}void* getBaseAllocation(void *ptr, size_t *size);
// void* getBaseAllocation(void *ptr, size_t *size); 的实现
void* getBaseAllocation(void *ptr, size_t *size)
{return caching_allocator.getBaseAllocation(ptr, size);
}---// THCCachingAllocator 类的成员函数被 void* getBaseAllocation 调用
void* getBaseAllocation(void* ptr, size_t* outSize)
{Block* block get_allocated_block(ptr);if (!block) {AT_ERROR(invalid device pointer: , ptr);}return device_allocator[block-device]-getBaseAllocation(block, outSize);
}---/*** 被 THCCachingAllocator 类的成员函数 void* getBaseAllocation 调用
*/
void* getBaseAllocation(Block* block, size_t* outSize) {std::lock_guardstd::recursive_mutex lock(mutex);while (block-prev) { // 找到一个 segment 的头指针block block-prev;}void *basePtr block-ptr; // 找到了暂存给 basePtrif (outSize) {size_t size 0;while (block) {size block-size;block block-next;}*outSize size; // 求的应该是这个 segment 的长度}return basePtr;
}待更新……