'벤치마크정의benchmark'에 해당하는 글 1건

2008.10.30 Benchmark (벤치마크)

Benchmark (벤치마크)

IT와 생활 2008. 10. 30. 10:41

share this post

사전적 의미

benchmark /bɛntʃmɑrk/

- noun
1. a standard of excellence, achievement, etc., against which similar things must be measured or judged: The new hotel is a benchmark in opulence and comfort.

2. any standard or reference by which others can be measured or judged: The current price for crude oil may become the benchmark.

3. Computers. an established point of reference against which computers or programs can be measured in tests comparing their performance, reliability, etc.
- adjective
4. of, pertaining to, or resulting in a benchmark: benchmark test, benchmark study.

소개

컴퓨팅 분야에서 벤치마크란 하나 혹은 여러가지 프로그램들 또는 다른 작업들을 실행시키는 행위이며, 일반적으로 많은 수의 표준 테스트와 시도들을 실행함으로써 대상의 상대적인 성능을 측정하기 위해 수행된다. 벤치마크라는 용어는 정교하게 고안된 벤치마킹 프로그램 그 자체를 의미하려는 목적으로 주로 쓰인다. 벤치마킹은 예를 들면 CPU의 부동소수점 수행성능 같은 일반적으로 컴퓨터 하드웨어 성능 특성을 측정하는 것과 관계가 있다. 허나 그 기술이 소프트웨어에 또한 적용될 수 있는 환경도 존재한다. 소프트웨어 벤치마크들은 예를 들어 컴파일러나 DB관리 시스템을 대상으로 실행된다. Test suites 혹은 validation suites 으로 불리는 다른 유형의 테스트 프로그램은 소프트웨어의 정확성을 측정하기 위해 고안되었다.

벤치마크는 다른 칩/시스템 구조에 대응하는 다양한 하부 시스템들의 성능을 비교하는 수단을 제공한다. 벤치마킹은 다양한 조건하에서 데이터베이스 매니저가 어떻게 반응하는지를 이해하는데 유용하다. 당신은 데드락 핸들링(교착상태 처리), 점유율 성능, 데이터를 로딩하는 다른 방법들, 더 많은 사용자가 추가되었을 때 트랜잭션 비율 특성 그리고 심지어 새로 출시된 제품을 사용하는 어플리케이션에 대한 영향 등을 테스트하는 시나리오들을 만들 수 있다.

목적

컴퓨터 아키텍쳐가 발전함에 따라, 그것들의 명세를 보는 것만으로 단순하게 다양한 컴퓨터 시스템들의 성능을 비교하는 것은 더욱 힘들어졌다. 그러므로 다른 아키텍쳐들의 대조를 가능하게 하는 테스트들이 개발되었다. 예를 들면, 펜티엄4 프로세서가 일반적으로 애슬론XP 프로세서보다 더 높은 클럭에서 동작하는데, 그것이 반드시 연산력이 좋다고 해석하지 않는다. 클럭 주파수를 고려할 때 더 느린 프로세서가 더 높은 대역에서 동작하는 프로세서만큼 성능을 낼 수 있다. BogoMips 와 megahertz myth를 참조하라.

벤치마크들은 컴포넌트나 시스템에 대한 특정한 형태의 workload를 흉내내기 위해 설계된다. 종합 벤치마크는 컴포넌트에 workload를 부과하도록 특별히 만들어진 프로그램을 통해 이를 수행한다. 어플리케이션 벤치마크는 시스템에 실제 프로그램을 실행시킨다. 어플리케이션 벤치마크가 일반적으로 대상 시스템에 대한 실성능 측정에 훨씬 유용한 반면, 종합 벤치마크는 하드디스크나 네트워킹 장비 같은 독립적인 컴포넌트들을 테스트하는데 유용하다.

프로세서 설계자들에게 마이크로아키텍쳐적인 결정에서 거래를 측정하고 결정할 수 있는 능력을 제공하는 벤치마크는 CPU 디자인에 있어 매우 중요하다. 예를 들면, 벤치마크가 어플리케이션의 핵심 알고리즘을 추출한다면, 그것이 그 어플리케이션의 성능에 민감한 측면을 포함할 것이다. Cycle-accurate simulator에 대한 이 훨씬 작은 조각들을 수행하는 것은 성능을 향상 시키는 방법에 대한 단서를 제공할 것이다.

2000년 이전에 컴퓨터와 마이크로프로세서 설계자들은 비록 SPEC의 유닉스 기반 벤치마크를 사용하기가 꽤 장황하고 다루기 힘들었지만, 이것을 하기 위해 SPEC를 사용했다. 컴퓨터 생산자들은 실사용에서 적용되지 않는 벤치마크 테스트에 대한 비현실적인 높은 성능을 제공하기 위해 그들의 시스템들을 설정한 것으로 알려졌다. 예를 들면, 1980년대에 일부 컴파일러들은 잘 알려진 부동소수점 벤치마크에서 사용된 특정한 수학적인 동작을 감지하였고, 그 동작을 더 빠른 수학적으로 동등한 동작으로 대체할 수 있었다. 하지만 그런 전환은 RISC와 VLIW 설계자들이 컴파일러 기술의 중요성을 성능에 연관있는 것이라고 강조했던 1990년대 중반까지 벤치마크 외부에서 거의 쓸모가 없었다. 벤치마크는 이제 단지 자신들 회사만의 벤치마크 점수를 향상시키기 위한 것이 아니라 실제 어플리케이션 성능을 향상시키기 위해 컴파일러 회사들에 의해 정기적으로 사용된다.

많은 실행 유닛을 가진 CPU – superscalar CPU, VLIW CPU 또는 재설정가능한 컴퓨팅 CPU – 는 일반적으로 그저 빠른 트랜지스터에서 만들어질 때, 하나 또는 두개의 실행 유닛을 가진 연쇄 CPU보다 느린 클럭비율을 가지고 있다. 그럼에도 불구하고, 많은 실행 유닛들을 가진 CPU가 추측한 것처럼 더 빠른 높은 클럭의 CPU보다 적은 시간내에 종종 실환경과 벤치마크 작업을 완수하는 경우가 있다.

Challenges

Benchmarking is not easy and often involves several iterative rounds in order to arrive at predictable, useful conclusions. Interpretation of benchmarking data is also extraordinarily difficult. Here is a partial list of common challenges:

Vendors tend to tune their products specifically for industry-standard benchmarks. Norton SysInfo (SI) is particularly easy to tune for, since it mainly biased toward the speed of multiple operations. Use extreme caution in interpreting such results.
Many benchmarks focus entirely on the speed of computational performance, neglecting other important features of a computer system, such as:

Benchmarks generally do not give any credit for any qualities of service aside from raw performance. Examples of unmeasured qualities of service include security, availability, reliability, execution integrity, serviceability, scalability (especially the ability to quickly and nondisruptively add or reallocate capacity), etc. There are often real trade-offs between and among these qualities of service, and all are important in business computing. TPC Benchmark specifications partially address these concerns by specifying ACID property tests, database scalability rules, and service level requirements.
In general, benchmarks do not measure TCO. TPC Benchmark specifications partially address this concern by specifying that a metric must be reported in addition to a raw performance metric, using a simplified TCO formula.
Electrical power. When more power is used, a portable system will have a shorter battery life and require recharging more often. This is often the antithesis of performance as most semiconductors require more power to switch faster. See also performance per watt.
In some embedded systems, where memory is a significant cost, better code density can significantly reduce costs.

Benchmarks seldom measure real world performance of mixed workloads — running multiple applications concurrently in a full, multi-department or multi-application business context. For example, IBM's mainframe servers (System z9) excel at mixed workload, but industry-standard benchmarks don't tend to measure the strong I/O and large and fast memory design such servers require. (Most other server architectures dictate fixed function ( single purpose ) deployments, e.g. "database servers" and "Web application servers" and "file servers," and measure only that. The better question is, "What more computing infrastructure would I need to fully support all this extra workload?")
Vendor benchmarks tend to ignore requirements for development, test, and disaster recovery computing capacity. Vendors only like to report what might be narrowly required for production capacity in order to make their initial acquisition price seem as low as possible.
Benchmarks are having trouble adapting to widely distributed servers, particularly those with extra sensitivity to network topologies. The emergence of grid computing, in particular, complicates benchmarking since some workloads are "grid friendly," while others are not.
Users can have very different perceptions of performance than benchmarks may suggest. In particular, users appreciate predictability — servers that always meet or exceed service level agreements. Benchmarks tend to emphasize mean scores (IT perspective) rather than low standard deviations (user perspective).
Many server architectures degrade dramatically at high (near 100%) levels of usage — "fall off a cliff" — and benchmarks should (but often do not) take that factor into account. Vendors, in particular, tend to publish server benchmarks at continuous at about 80% usage — an unrealistic situation— and do not document what happens to the overall system when demand spikes beyond that level.
Benchmarking institutions often disregard or do not follow basic scientific method. This includes, but is not limited to: small sample size, lack of variable control, and the limited repeatability of results.^[1]

벤치마크의 종류

1. 실제 프로그램
- 워드 프로세싱 소프트웨어
- CDA의 툴 소프트웨어
- 사용자의 어플레케이션 소프트웨어(MIS)

2. 커널
- 핵심 코드들을 포함한다.
- 일반적으로 실제 프로그램에서 추출된다.
- 잘 알려진 커널: Livermore loop
- linpack 벤치마크 (포트란 언어로 작성된 basic linear algebra subroutine)
- MFLOPS로 표현된 결과들

3. 컴포넌트 벤치마크/ 마이크로 벤치마크
- 컴퓨터의 기본 구성요소들의 성능을 측정하기 위해 고안된 프로그램들
- 레지스터, 캐쉬 크기, 메모리 지연율 같은 컴퓨터 하드웨어 파라메터의 자동 감지

4. 종합 벤치마크(Synetic benchmark)
- 종합 벤치마크 프로그래밍을 위한 절차
다양한 어플리케이션 프로그램에서 모든 유형의 작업 통계를 얻는다.
각 작업의 비율을 구한다.
위 비율을 근거로 하여 프로그램을 작성한다.

- 종합 벤치마크의 유형
Whetstone
Dhrystone

- 그것들은 최초의 산업 표준 컴퓨터 벤치마크의 일반적인 목적들이다. 그것들은 현대의 파이프라인 컴퓨터들에 대해 반드시 높은 점수를 기록하지는 않는다.

5. I/O 벤치마크

6. Parallel 벤치마크: 다중 프로세서나 다중 장비들로 구성된 시스템을 가진 장비에 사용된다.

Common benchmarks

Industry Standard (audited and verifiable)

Open source benchmarks

DEISA Benchmark Suite: scientific HPC applications benchmark
Dhrystone: integer arithmetic performance
Fhourstones: an integer benchmark
HINT: It ranks a computer system as a whole.
Iometer: I/O subsystem measurement and characterization tool for single and clustered systems.
Linpack / LAPACK
NAS parallel benchmarks
PAL: a benchmark for realtime physics engines
POV-Ray: 3D render
TPoX: An XML transaction processing benchmark for XML databases
VMmark: a server virtualization benchmark suite from VMware.
Whetstone: floating-point arithmetic performance
Phoronix Test Suite: open-source benchmarking suite for Linux, Solaris, and FreeBSD

Microsoft Windows benchmarks

BAPCo: MobileMark, SYSmark, WebMark
Futuremark:3DMark, PCMark
Whetstone
PiFast
Super PI
WinSAT, exclusively for Windows Vista, providing an index for consumers to rate their systems easily

Others

BRL-CAD
Khornerstone
iCOMP, the Intel comparative microprocessor performance, published by Intel
Performance Rating, modelling scheme used by AMD and Cyrix to reflect the relative performance usually compared to competing products

과제들
벤치마킹은 쉬운 것이 아니며 예측할 수 있고 유용한 결론에 도달하기 위해 수차례의 반복적인 시도들이 자주 요구된다. 벤치마킹 데이터의 해석은 또한 엄청나게 힘들다. 여기 일반적인 당면과제들의 리스트를 보자.

- 벤더들은 특히 산업 표준 벤치마크에 맞게 자신들의 제품을 조정하려는 경향이 있다. Norton SysInfo는 다중 작업의 속도에만 주로 편향되어 있기 때문에 조정하기가 특히 쉽다. 그런 결과들을 해석하는데 있어 각별히 주의할 필요가 있다.

- 많은 벤치마크들이 컴퓨터 시스템의 다른 중요한 기능들을 무시한 채, 연산적인 성능의 속도에 대부분 집중한다. 예를 들면,
○ 벤치마크들은 일반적으로 실질적인 성능 외에 어떤 QOS에 대한 보증을 제공하지 않는다. 측정되지 않은 QOS의 예로는 보안성, 가용성, 신뢰성, 실행 무결성, 서비스 제공성, 확장성(특히 신속하게 그리고 비파괴적으로 용량을 추가하거나 재할당하는 능력) 등. 그런 QOS 그리고 비즈니스 컴퓨팅에서 중요한 모든 것들 사이에서 실제로 잦은 교환이 이뤄지고 있다. TPC 벤치마크 명세는 ACID 속성 테스트, 데이터 베이스 확장성 규칙 그리고 서비스 레벨 요구사항 등을 규정함으로써 그런 염려들을 부분적으로 거론한다.

○ 일반적으로 벤치마크들은 TCO를 측정하지 않는다. TPC 벤치마크 명세는 단순화 된 TCO 법칙을 이용하여 실질적인 성능 메트릭에 추가로 보고되어야 할 메트릭을 정의함으로써 이런 걱정을 부분적으로 거론한다.

○ 전력. 더 많은 전력이 사용될 때, 휴대용 시스템은 더 짧은 배터리 수명을 가질 것이고 더 자주 재충전을 필요로 할 것이다. 이것은 대부분의 반도체 소자가 더 빨리 전환하기 위하여 더 많은 전력을 필요로 하는 것과 흔히 정반대이다. 다음을 참조하라. performance per watt.

○ 메모리가 상당히 비싼 몇몇 임베디드 시스템에서 더 좋은 코드 밀집도는 비용을 현저하게 줄일 수 있다.

[edit] Challenges

과제들
벤치마킹은 쉬운 것이 아니며 예측할 수 있고 유용한 결론에 도달하기 위해 수차례의 반복적인 시도들이 자주 요구된다. 벤치마킹 데이터의 해석은 또한 엄청나게 힘들다. 여기 일반적인 당면과제들의 리스트를 보자.

- 벤더들은 특히 산업 표준 벤치마크에 맞게 자신들의 제품을 조정하려는 경향이 있다. Norton SysInfo는 다중 작업의 속도에만 주로 편향되어 있기 때문에 조정하기가 특히 쉽다. 그런 결과들을 해석하는데 있어 각별히 주의할 필요가 있다.

- 많은 벤치마크들이 컴퓨터 시스템의 다른 중요한 기능들을 무시한 채, 연산적인 성능의 속도에 대부분 집중한다. 예를 들면,
○ 벤치마크들은 일반적으로 실질적인 성능 외에 어떤 QOS에 대한 보증을 제공하지 않는다. 측정되지 않은 QOS의 예로는 보안성, 가용성, 신뢰성, 실행 무결성, 서비스 제공성, 확장성(특히 신속하게 그리고 비파괴적으로 용량을 추가하거나 재할당하는 능력) 등. 그런 QOS 그리고 비즈니스 컴퓨팅에서 중요한 모든 것들 사이에서 실제로 잦은 교환이 이뤄지고 있다. TPC 벤치마크 명세는 ACID 속성 테스트, 데이터 베이스 확장성 규칙 그리고 서비스 레벨 요구사항 등을 규정함으로써 그런 염려들을 부분적으로 거론한다.

○ 일반적으로 벤치마크들은 TCO를 측정하지 않는다. TPC 벤치마크 명세는 단순화 된 TCO 법칙을 이용하여 실질적인 성능 메트릭에 추가로 보고되어야 할 메트릭을 정의함으로써 이런 걱정을 부분적으로 거론한다.

○ 전력. 더 많은 전력이 사용될 때, 휴대용 시스템은 더 짧은 배터리 수명을 가질 것이고 더 자주 재충전을 필요로 할 것이다. 이것은 대부분의 반도체 소자가 더 빨리 전환하기 위하여 더 많은 전력을 필요로 하는 것과 흔히 정반대이다. 다음을 참조하라. performance per watt.

○ 메모리가 상당히 비싼 몇몇 임베디드 시스템에서 더 좋은 코드 밀집도는 비용을 현저하게 줄일 수 있다.

Benchmarking is not easy and often involves several iterative rounds in order to arrive at predictable, useful conclusions. Interpretation of benchmarking data is also extraordinarily difficult. Here is a partial list of common challenges:

Vendors tend to tune their products specifically for industry-standard benchmarks. Norton SysInfo (SI) is particularly easy to tune for, since it mainly biased toward the speed of multiple operations. Use extreme caution in interpreting such results.
Many benchmarks focus entirely on the speed of computational performance, neglecting other important features of a computer system, such as:

Benchmarks generally do not give any credit for any qualities of service aside from raw performance. Examples of unmeasured qualities of service include security, availability, reliability, execution integrity, serviceability, scalability (especially the ability to quickly and nondisruptively add or reallocate capacity), etc. There are often real trade-offs between and among these qualities of service, and all are important in business computing. TPC Benchmark specifications partially address these concerns by specifying ACID property tests, database scalability rules, and service level requirements.
In general, benchmarks do not measure TCO. TPC Benchmark specifications partially address this concern by specifying that a metric must be reported in addition to a raw performance metric, using a simplified TCO formula.
Electrical power. When more power is used, a portable system will have a shorter battery life and require recharging more often. This is often the antithesis of performance as most semiconductors require more power to switch faster. See also performance per watt.
In some embedded systems, where memory is a significant cost, better code density can significantly reduce costs.

Benchmarks seldom measure real world performance of mixed workloads — running multiple applications concurrently in a full, multi-department or multi-application business context. For example, IBM's mainframe servers (System z9) excel at mixed workload, but industry-standard benchmarks don't tend to measure the strong I/O and large and fast memory design such servers require. (Most other server architectures dictate fixed function ( single purpose ) deployments, e.g. "database servers" and "Web application servers" and "file servers," and measure only that. The better question is, "What more computing infrastructure would I need to fully support all this extra workload?")

벤치마크는 뒤섞인 workload(모든, 여러 부서 또는 여러 어플리케이션 비즈니스 환경에서 한꺼번에 다수의 어플리케이션이 완전하게 실행되는)의 실제 성능을 측정하기란 거의 불가능하다. 예를 들면, IBM의 메인 프레임 서버(system z9)이 혼합된 워크로드를 발생시킨다. 하지만 산업표준 벤치마크는 그런 서버들이 요구하는 강력한 I/O과 거대하고 빠른 메모리 설계를 측정하려 하지않는 경향이 있다. (대부분의 다른 서버 설계자들은 한정된 기능(단일 목적의) 배치만을 강요한다. 예로 "데이터베이스 서버들"과 "웹 어플리케이션 서버들" 그리고 "파일서버들" 그리고 그것만 측정한다. 더 좋은 질문은 "이런 추가적인 워크로드 전부를 완전하게 지원하기 위해 내가 필요로 할 더 나은 컴퓨팅 환경이란 무엇인가?")
Vendor benchmarks tend to ignore requirements for development, test, and disaster recovery computing capacity. Vendors only like to report what might be narrowly required for production capacity in order to make their initial acquisition price seem as low as possible.

벤더 벤치마크는 개발, 테스트 그리고 재난 복구 컴퓨팅 확장성에 대한 요구사항들을 무시하는 경향이 있다. 벤더들은 그들의 초기 인수 비용을 최대한 낮아 보이게 하기 위해서 제품 가용성에 대해 간신히 요구된 리포트만 좋아한다.
Benchmarks are having trouble adapting to widely distributed servers, particularly those with extra sensitivity to network topologies. The emergence of grid computing, in particular, complicates benchmarking since some workloads are "grid friendly," while others are not.

벤치마크는 네트워크 토폴로지에 추가적인 민감도를 가진 널리 배포된 서버들에 적용하는데 문제를 겪고 있다. 특히 그리드 컴퓨팅의 출현은 일부 워크로드들은 친 그리드하지만, 다른 것들은 그렇지 않기 때문에 벤치마킹을 복잡하게 만든다.
Users can have very different perceptions of performance than benchmarks may suggest. In particular, users appreciate predictability — servers that always meet or exceed service level agreements. Benchmarks tend to emphasize mean scores (IT perspective) rather than low standard deviations (user perspective).

사용자들은 벤치마크가 제시하는 것 보다 성능의 매우 다른 인식을 가지고 있다. 특히 사용자들은 가능성(service level agreements를 항상 만족하거나 초과하는 서버)을 높이 평가한다.
벤치마크는 낮은 표준 편차(사용자 관점)보다 평균값(IT 관점)을 과장하는 경향이 있다.
Many server architectures degrade dramatically at high (near 100%) levels of usage — "fall off a cliff" — and benchmarks should (but often do not) take that factor into account. Vendors, in particular, tend to publish server benchmarks at continuous at about 80% usage — an unrealistic situation— and do not document what happens to the overall system when demand spikes beyond that level.

많은 서버 구조들은 높은 점유율(거의 100%) 수준에서 급격하게 성능저하된다. 그리고 벤치마크는 그 요소를 고려해야 한다(하지만 종종 그러지 않는다.) 벤더들은 특히 약 80% 점유율이 지속되는 상황에서(비현실적인 상황) 서버 벤치마크를 발행하려는 경향이 있다. 그리고 요구가 그 단계를 넘어 급격히 상승할 때 전체 시스템에 무엇이 일어나는지 문서화하지 않는다.
Benchmarking institutions often disregard or do not follow basic scientific method. This includes, but is not limited to: small sample size, lack of variable control, and the limited repeatability of results.^[1]

벤치마킹 기관들이 기초적인 과학적 방법을 무시하거나 따르지 않는다. 이는 다음을 포함하지만 꼭 여기에 국한되지는 않는다.: 작은 샘플 크기, 변수 관리의 부족 그리고 결과의 제한된 반복성.

[edit] Types of benchmarks

Real program

word processing software
tool software of CDA
user's application software (MIS)

Kernel

contains key codes
normally abstracted from actual program
popular kernel: Livermore loop
linpack benchmark (contains basic linear algebra subroutine written in FORTRAN language)
results are represented in MFLOPS

Component Benchmark/ micro-benchmark

programs designed to measure performance of a computer's basic components
automatic detection of computer's hardware parameters like number of registers, cache size, memory latency

Synthetic Benchmark

Procedure for programming synthetic Bench mark

take statistics of all type of operations from plenty of application programs
get proportion of each operation
write a program based on the proportion above

Types of Synthetic Benchmark are:

These were the first general purpose industry standard computer benchmarks. They do not necessarily obtain high scores on modern pipelined computers.

I/O benchmarks
Parallel benchmarks: used on machines with multiple processors or systems consisting of multiple machines.

출처: http://en.wikipedia.org/wiki/Benchmark_(computing)
번역: 본인(mryou.tistory.com)

저작자표시

'IT와 생활' 카테고리의 다른 글

Severity Vs Priority (0)	2008.11.03
Entry and Exit Criteria (0)	2008.11.03
Overview of the TPC Benchmark C (0)	2008.10.27
What is the TPC (0)	2008.10.27
[Tip] 실행 명령어로 제어판과 관리콘솔 제어하기 (0)	2008.03.12

WRITTEN BY

: 하이런