| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242 |
- r"""
- This module exposes a TunableOp interface.
- Some operations, such as GEMMs, could be implemented using more than one library
- or more than one technique. For example, a GEMM could be implemented for CUDA or
- ROCm using either the blas or blasLt libraries. Further, ROCm's rocblas and
- hipblaslt libraries allow the user to query for all possible algorithms and then
- choose one. How does one know which implementation is the fastest and should be
- chosen? That's what TunableOp provides.
- Enabling TunableOp and Tuning Separately
- ========================================
- The TunableOp feature is enabled separately from enabling the tuning phase
- itself. Enabling TunableOp means that PyTorch will replace any standard
- operators with their Tunable implementations. Any call to a TunableOp first
- checks whether it has already been tuned for the given operator inputs. If so,
- it will immediately call the tuned operation; no further tuning will take place
- even when the tuning setting is enabled. Instead if no tuning result is found,
- and tuning is enabled, the TunableOp will benchmark every registered
- implementation of that operator for the given set of inputs and select the
- fastest.
- File Input and Output
- =====================
- The first time any TunableOp is invoked, the internal database of tuned
- operations will be prepared by attempting to read the results from the given
- file. The default filename is 'tunableop_results.csv'. To support tuning when
- multiple GPUs are used across multiple processes, the GPU device ordinal is
- automatically inserted into the filename to avoid multiple processes overwriting
- the same file.
- If tuning is enabled and new tunings are discovered during the course of your
- workload, it will also write out to this same filename with all tunings, both
- the ones it read in at startup as well as the new ones found at runtime. This
- can be used, for example, to build up a tunings file across many workloads by
- reusing the same file. The output file is automatically created when the
- application terminates. This behavior can be controlled by the C++ and Python
- APIs but not the environment variables.
- Assuming you specified a filename, you'll end up with a CSV file with contents
- like so::
- Validator,PT_VERSION,2.2.0
- Validator,ROCM_VERSION,6.0.0.0-12969-1544e39
- Validator,HIPBLASLT_VERSION,0.6.0-a9c5cc7
- Validator,ROCBLAS_VERSION,4.0.0-72e57364-dirty
- GemmTunableOp_float_NT,nt_25088_4096_64,1219,1.262
- GemmTunableOp_float_NT,nt_4096_4096_64,1216,0.033
- Note the "Validator" lines. If you change a library verison, or ROCm version, or
- PyTorch version, TunableOp will detect this and reject the tunings file because
- the prior tunings are likely affected by other software changes.
- The remaining lines are the tuned solutions for each TunableOp encountered
- during your execution. Each line consists of 4 comma-separated fields: operator
- name, operator parameters, solution name, and average execution time. The
- execution time is an optional field. The CSV file can be edited, but with
- caution. For example, the solution name (field 3) can be changed to "Default"
- and it will fall back to the original PyTorch untuned implementation. Or, in the
- case of ROCm's hipBLAS or hipBLASLt libraries, if you know the specific solution
- index you can override the solution that TunableOp selected by replacing the
- value. The operator name and parameters (fields 1 and 2) are internally named
- and should not be modified. In the case of GemmTunableOp, field 1 indicates the
- datatype and whether the inputs are transposed (T) or not (N) and field 2
- indicates the M, N, K input shapes.
- There is an option to enable verbose output but it is only recommended for
- debugging purposes. This will produce a lot of diagnostic messages but may be
- useful to see if TunableOp is being used at all. Otherwise, TunableOp is
- completely silent, besides file output, unless there is a warning or error
- during its use. The verbose option is only available by setting the environment
- variable PYTORCH_TUNABLEOP_VEROBSE=1.
- A Note on Tuning Behavior
- =========================
- Tuning an operator consists of iterating through the list or registered
- implementations and profiling each one. The profile is established by running a
- single implementation in a loop multiple times and taking the average execution
- time.
- By default, each possible solution for a given operator will be run for either
- 100 iterations or as many iterations that can be run within 30ms, whichever is
- smaller, and its average execution will be calculated. The fastest solution
- among all that were successfully profiled will be chosen. A profile might fail
- if the given solution doesn't achieve the same accuracy as the default
- implementation or if the solution returns an error code.
- Current Tunable Operators
- =========================
- TunableGemm for ROCm
- --------------------
- Currently only a TunableGemm for ROCm is implemented. Note that CUDA builds of
- PyTorch will function correctly when using TunableOp but the only solution
- available to CUDA builds is the 'Default' implementation i.e. the original
- cuBLAS default, now called through TunableOp. Any call to at::cuda::blas::gemm()
- or ::bgemm() will be routed through TunableOp when enabled. Calling gemm() for a
- given set of input arguments (transa, transb, m, n, k) will attempt to use the
- fastest available implementation across both rocblas and hipblaslt.
- Tuning Context
- ==============
- The behavior of TunableOp is currently manipulated through environment
- variables, the C++ interface of at::cuda::tunable::getTuningContext(), or the
- torch.cuda.tunable python interfaces that wrap the C++ TuningContext. The
- environment variables take precedence over any setting you manipulate using the
- C++ or Python APIs.
- """
- from typing import Optional, Tuple
- import torch
- __all__ = [
- "enable",
- "is_enabled",
- "tuning_enable",
- "tuning_is_enabled",
- "set_max_tuning_duration",
- "get_max_tuning_duration",
- "set_max_tuning_iterations",
- "get_max_tuning_iterations",
- "set_filename",
- "get_filename",
- "get_results",
- "get_validators",
- "write_file_on_exit",
- "write_file",
- "read_file",
- ]
- def enable(val: bool = True) -> None:
- r"""This is the big on/off switch for all TunableOp implementations."""
- torch._C._cuda_tunableop_enable(val) # type: ignore[attr-defined]
- def is_enabled() -> bool:
- r"""Returns whether the TunableOp feature is enabled."""
- return torch._C._cuda_tunableop_is_enabled() # type: ignore[attr-defined]
- def tuning_enable(val: bool = True) -> None:
- r"""Enable tuning of TunableOp implementations.
- When enabled, if a tuned entry isn't found, run the tuning step and record
- the entry.
- """
- torch._C._cuda_tunableop_tuning_enable(val) # type: ignore[attr-defined]
- def tuning_is_enabled() -> bool:
- r"""Returns whether TunableOp implementations can be tuned."""
- return torch._C._cuda_tunableop_tuning_is_enabled() # type: ignore[attr-defined]
- def set_max_tuning_duration(duration: int) -> None:
- r"""Set max time in milliseconds to spend tuning a given solution.
- If both max tuning duration and iterations are set, the smaller of the two
- will be honored. At minimum 1 tuning iteration will always be run.
- """
- torch._C._cuda_tunableop_set_max_tuning_duration(duration) # type: ignore[attr-defined]
- def get_max_tuning_duration() -> int:
- r"""Get max time to spend tuning a given solution."""
- return torch._C._cuda_tunableop_get_max_tuning_duration() # type: ignore[attr-defined]
- def set_max_tuning_iterations(iterations: int) -> None:
- r"""Set max number of iterations to spend tuning a given solution.
- If both max tuning duration and iterations are set, the smaller of the two
- will be honored. At minimum 1 tuning iteration will always be run.
- """
- torch._C._cuda_tunableop_set_max_tuning_iterations(iterations) # type: ignore[attr-defined]
- def get_max_tuning_iterations() -> int:
- r"""Get max iterations to spend tuning a given solution."""
- return torch._C._cuda_tunableop_get_max_tuning_iterations() # type: ignore[attr-defined]
- def set_filename(filename: str, insert_device_ordinal: bool = False) -> None:
- r"""Set the filename to use for input/output of tuning results.
- If :attr:`insert_device_ordinal` is ``True`` then the current device ordinal
- will be added to the given filename automatically. This can be used in a
- 1-process-per-gpu cenario to ensure all processes write to a separate file.
- """
- torch._C._cuda_tunableop_set_filename(filename, insert_device_ordinal) # type: ignore[attr-defined]
- def get_filename() -> str:
- r"""Get the results filename."""
- return torch._C._cuda_tunableop_get_filename() # type: ignore[attr-defined]
- def get_results() -> Tuple[str, str, str, float]:
- r"""Return all TunableOp results."""
- return torch._C._cuda_tunableop_get_results() # type: ignore[attr-defined]
- def get_validators() -> Tuple[str, str]:
- r"""Return the TunableOp validators."""
- return torch._C._cuda_tunableop_get_validators() # type: ignore[attr-defined]
- def write_file_on_exit(val: bool) -> None:
- r"""During Tuning Context destruction, write file to disk.
- This is useful as a final flush of your results to disk if your application
- terminates as result of normal operation or an error. Manual flushing of
- your results can be achieved by manually calling ``write_file()``."""
- torch._C._cuda_tunableop_write_file_on_exit(val) # type: ignore[attr-defined]
- def write_file(filename: Optional[str] = None) -> bool:
- r"""Write results to a CSV file.
- If :attr:`filename` is not given, ``get_filename()`` is called.
- """
- if filename is None:
- filename = get_filename()
- return torch._C._cuda_tunableop_write_file(filename) # type: ignore[attr-defined]
- def read_file(filename: Optional[str] = None) -> bool:
- r"""Read results from a TunableOp CSV file.
- If :attr:`filename` is not given, ``get_filename()`` is called.
- """
- if filename is None:
- filename = get_filename()
- return torch._C._cuda_tunableop_read_file(filename) # type: ignore[attr-defined]
|