AES
Timings of the best known implementations

Best timings for internal ciphering routine (128 bits blocks)
Mars RC6 Rijndael Serpent Twofish Cast 256 Crypton DEAL DFC E2 Frog HPC Loki 97 Magenta Safer +
Intel x86 all use 32 bits word
Intel 486DX2 50 MHz 199KB/s 3925c (g) 292KB/s 2680c (b) 330KB/s 2370c (b) 60KB/s 12900c (b) 281KB/s 2785c (b) 254KB/s 3075c (g) 139KB/s 5615c (b) 94KB/s 8315c (b) 306KB/s 2550c (1d) 215KB/s 3630c (b) 154KB/s 5075c (b) 88KB/s 8855c (b) 33KB/s 23420c(b) 26KB/s 30105c(b) 76KB/s 10300c (b)
Cyrix 6x86MX 166 MHz 3.58MB/s 707c (b) 3.59MB/s 706c (b) 2.99MB/s 847c (b) 1.24MB/s 2042c (b) 3.69MB/s 687c (b) 2.42MB/s 1046c (b) 2.68MB/s 946c (b) 0.63MB/s 4017c (b) 2.74MB/s 924c (1d) 2.63MB/s 963c (b) 2.24MB/s 1129c (b) 1.30MB/s 1954c (*g) 0.67MB/s 3785c (b) 0.20MB/s 12616c(b) 1.28MB/s 1975c (b)
Intel Pentium 90 MHz 1.81MB/s 760c (a) 550c (asm) 1.81MB/s 758c (a) 700c (asm) 4.29MB/s 320c (2j) 1.07MB/s 1279c (a) 1100c (asm) 4.74MB/s 290c (2m) 1.21MB/s 1134c (b) 600c (asm) 1.69MB/s 811c (2b) 390c (asm) 0.29MB/s 4788c (b) 2200c (asm) 2.25MB/s 609c (2d) 1.27MB/s 1079c (a) 410c (asm) 0.68MB/s 2007c (b) 0.56MB/s 2439c (*g) 0.29MB/s 4698c (b) 0.11MB/s 12708c(b) 0.56MB/s 2449c (a) 1100c (asm)
Pentium II ¹ 200 MHz 9.97MB/s 306c (3h) 13.68MB/s 223c (3i) 12.88MB/s 237c (3j) 3.21MB/s 952c (a) 900c (asm) 11.82MB/s 258c (3m) 4.82MB/s 633c (a) 600c (asm) 7.92MB/s 385c (3b) 345c (3b) 1.30MB/s 2339c (a) 2200c (asm) 7.77MB/s 393c (3d) 8.60MB/s 355c (3e) 1.47MB/s 2080c (b) 2.14MB/s 1429c (a) 1.43MB/s 2134c(a) 0.47MB/s 6539c(a) 4.12MB/s 740c (3k)
Sparc all use 32 bits word, UltraSparc can use 64 bits word
TurboSparc 170 MHz 2.42MB/s 1071c (c) 2.99MB/s 867c (c) 2.93MB/s 884c (c) 1.45MB/s 1785c (c) 3.05MB/s 850c (c) 1.91MB/s 1356c (h) 2.63MB/s 986c (c) 0.59MB/s 4430c (h) 3.09MB/s 840c (4d) 2.21MB/s 1173c (c) 1.03MB/s 2516c (c) 1.25MB/s 2074c (c) 0.72MB/s 3604c (c) 0.22MB/s 11713c (c) 0.69MB/s 3740c (c)
SuperSparc 50 MHz 1.35MB/s 565c (c) 1.59MB/s 480c (c) 1.73MB/s 440c (c) 0.90MB/s 845c (c) 1.73MB/s 440c (c) 1.11MB/s 690c (c) 1.50MB/s 510c (c) 0.34MB/s 2263c (h) 1.27MB/s 600c (5d) 1.35MB/s 565c (c) 0.64MB/s 1195c (c) 0.74MB/s 1030c (c) 0.46MB/s 1675c (c) 0.13MB/s 5920c (c) 0.37MB/s 2040c (c)
UltraSparcIIi 270 MHz 5.09MB/s 810c (c) 3.54MB/s 1164c (c) 12.37MB/s 333c (6j) 4.21MB/s 979c (c) 10.99MB/s 375c (6m) 6.17MB/s 667c (c) 9.60MB/s 429c (c) 1.79MB/s 2306c (h) 5.32MB/s 775c (6d) 6.20MB/s 664c (c) 1.82MB/s 2268c (b) 9.16MB/s 450c (*g) 2.31MB/s 1782c (c) 0.60MB/s 6858c (c) 1.59MB/s 2592c (b)
Dec Alpha all use 64 bits word
Alpha EV45 1303c (c) 1247c (c) 907c (c) 1785c (c) 850c (c) 1275c (c) 878c (c) 5241c (c) 512c (7d) 1247c (c) 5780c (c) 662c (*g) 6970c (c) 22722c (c) 10710c (c)
Alpha EV56 507c (7h) 478c (KA) 559c (c) 467c (KA) 439c (d) 340c (KA) 984c (d) 915c (KA) 442c (d) 360c (KA) 749c (c) 600c (KA) 499c (c) 408c (KA) 2752c (c) 2528c (KA) 312c (7d) 304c (KA) 587c (7e) 471c (KA) 2752c (c) 402c (7g) 380c (KA) 2356c (c) 5074c (c) 1502c (c) 656c (KA)
Alpha EV6 ² 450c (c) 375c (RW) 382c (c) 360c (RW) 285c (c) 210c (RW) 854c (d) 570c (RW) 315c (c) 255c (RW) 615c (c) 353c (c) 2010c (c) 232c (7d) 510c (c) 3750c (c) 420c (*g) 1132c (c) 3600c (c) 929c (c)
PA-RISC PA7000 use 32 bits word, PA8200 and Merced/McKinley use 64 bits words
HP PA7000 950c (c) 1085c (c) 735c (c) 1345c (c) 755c (c) 1275c (c) 865c (c) 3940c (c) 1628c (c) 990c (c) 2620c (c) 1315c (*g) 4685c (c) 10315c (c) 5085c (c)
HP PA8500 538c (e) 493c (e) 168c (e) 580c (e) 200c (e)
Itanium (Merced) 511c (e) 490c (e) 125c (e) 565c (e) 182c (e) 240c (TM)
McKinley 525c (DW) 142c (DW) 720c (DW) 181c (DW)
Other Power and ARM use 32 bits word ; 8051 and 6805 use 8 bits word (smart cards CPUs) 8051 uses 256 bytes RAM and the MC68HC05SC41 variant of 6805 uses 128 bytes RAM
PowerPC ³ 300c (9h) 590c (9m)
ARM 790c (10i) 1467c (10j) 8406c (10m) 442c (10d) 2180c (10e)
8051 14500c (11i) 3168c (11j) 26147c (11e)
6805 358 Kc (12h) 106 Kc (12i) 9.5 Kc (12j) 126 Kc (12l) 26 Kc (12m) 31 Kc (12b)
The timings are measured for 128 bits keys and 128 bits messages. We give the speed rate and the number of cycles per block for the internal ciphering function. I do the measure with successive encryptions of the same memory location. If decryption is faster, this timing is given instead. These numbers can measure the optimal ciphering rate between two similar machines with a fixed key and a small plaintext in the cache. I put in white background the fastest ciphers (who are less than twice slower than the fastest one, in yellow background).
Sometime, surprisingly, in the same family of microprocessors, the older cpu is faster (in cyles, with the same code) than the newer one.
 ¹
For C code, timings on Pentium Pro should be only a little slower than on Pentium II, because the compilers don't distinguish them and only the cache size makes the difference. However, the MMX technology can speed asm implementations up to 26% (cf. Aoki and Lipmaa article for AES3).
 ²
This is the new Alpha 21264, the timings have been done on a pre-serial test-machine and may not be accurate for commercial models.
 ³
You can compare the cycle count on a PPC 604e and a PPC 750, since the latter is about a 604e with an L2 cache.
C/asm
Counterpane's performance comparison of the AES submissions : an estimation of the performance of an assembly or a C code on Pentium or Pentium Pro. Because it is an estimation, I put this in maroon.
KA
Kenneth Almquist's estimations of speeds on Alpha EV56. Because it is an estimation, I put this in maroon.
DW
Doug Whiting's estimations of speeds on Merced/McKinley (the first and the second generation Intel IA64 processor). Because it is an estimation (precision : 10%), I put this in maroon.
TM
Terje Mathisen's estimations of speeds on IA64. Because it is an estimation, I put this in maroon.
RW
Richard Weiss' estimations of speeds for assembly on EV6, presented at AES3. Because it is an estimation, I put this in maroon.
(a)
Brian Gladman, using Microsoft Visual C++ version 6.
(b)
Brian Gladman's code using gcc version 2.8.1.
(c)
Brian Gladman's code using native compiler (Sun cc version SWC-5.0 / Dec cc V5.8 / cc HP-UX 9 ).
(d)
Brian Gladman's code, slighly modified by Richard Weiss, using Dec cc V6.1 on Tru64.
(e)
Worley and al. timings presented at AES3 conference.
(g)
OptCCode using gcc version 2.8.1.
(h)
OptCCode using Sun cc version SWC-5.0.
Mars
(3h)
H. Lipmaa assembly on Pentium II (AES3 paper).
(7h)
Previous timings, done with a binary compiled with Dec cc V5.8 on a pre-serial EV6, were 507 cycles on EV56. The newer V6.x compilers are far less efficient : 701 cycles reported by Richard Weiss.
(9h)
C implementation on PPC 604e.
(12h)
Geoffrey Keating implementation on 6805 cpu core.
RC6
(3i)
K. Aoki and H. Lipmaa assembly on Pentium II (AES3 paper).
(10i)
UCL/Crypto implementation on ARM-based smart card.
(11i)
UCL/Crypto implementation on 8051-based smart card.
(12i)
Geoffrey Keating implementation on 6805 cpu core. Decryption is 5 times slower.
Rijndael
(2j)
assembly on Pentium.
(3j)
K Aoki and H. Lipmaa assembly on Pentium II (AES3 paper).
(6j)
Helger Lipmaa's C implementation for UltraSparc.
(10j)
UCL/Crypto implementation on ARM-based smart card, including key schedule.
(11j)
UCL/Crypto implementation on 8051-based smart card, including key schedule.
(12j)
Geoffrey Keating implementation on 6805 cpu core. Decryption is 50% slower.
Serpent
(12l)
Geoffrey Keating "bitslice" implementation on 6805 cpu core.
Note that Serpent can have a bitslice implentation on up to 32-bits cpus that can encrypt 128 bits blocks without mixing.
Twofish
(2m)
Author's assembly on Pentium.
(3m)
Author's assembly on Pentium II. Self modifying code.
(6m)
Helger Lipmaa's C implementation for UltraSparc.
(9m)
C implementation on PPC 750 (aka. G3).
(10m)
UCL/Crypto implementation on ARM-based smart card, including key schedule.
(12m)
Author's implementation on 6805 cpu core, quoted by Geoffrey Keating.
Crypton
(2b)
Author's C code using gcc, Crypton v1.0.
(3b)
Author's assembly on Pentium Pro, Crypton v1.0.
Helger Lipmaa guess that his Pentium II implementation of Rijndael will transpose in a 345 cycles implementation of Crypton.
(12b)
Geoffrey Keating implementation on 6805 cpu core.
DFC
(1d)
Philippe Hoogvorst made this assembly implementation optimised for i486, distributed in AddCode.
(2d)
Robert Harley modified the PPro assembly implementation to gain 136 cycles on Pentium.
(3d)
Dominik Behr, Robert Harley, Danjel McGougan and Terje Mathisen made this fast assembly implementation for Pentium Pro. This implementation can be made totally branchless and data-independent at a cost of 40 cycles (protection against timing attacks).
There is a faster implementation (387 cycles on Pentium Pro), but that one can only run with Windows, since it has to be the only thread running.
(4d)
AddCode C/assembly for sparc using Sun cc version SWC-5.0.
(5d)
Fabrice Noilhan's new C code with -DPPRO using Sun cc version SWC-5.0.
TurboSparc and SuperSparc are better cpus than UltraSparc for DFC because the multiplication 32 x 32 -> 64 bits takes less cycles.
(6d)
Robert Harley's code using floating point multiplication.
The best C code with integer multiplication runs in 875 cycles.
(7d)
Robert Harley's assembly for Alpha under Linux. This code is an optimisation of the previous one (C plus one opcode macro) that runs in 323 cycles on EV56.
(10d)
Robert Harley and David Seal's assembly for StrongARM.
E2
(3e)
NTT assembly (reported in AES2).
(7e)
NTT assembly on some 600MHz EV56 under Digital Unix 4.0.
(10e)
UCL/Crypto implementation on ARM-based smart card.
(11e)
UCL/Crypto implementation on 8051-based smart card : not enough RAM, must use non secure external RAM.
HPC
(*g)
Brian Gladman's C code, modified by myself to use 64 bits integers with gcc 2.8.1 or SWC-5.0 on UltraSparc V9 64-bits architecture or Dec cc V5.8 or HP-UX cc.
(7g)
Timings done by the author, on some 300 MHz Alpha.
Safer +
(3k)
Lily Chen (Cylink team) announced in AES forum an ANSI C implementation running at 33 Mbits/sec on Pentium Pro 200, with Borland compiler.
Best timings for (bitslice or other) internal ciphering routine (1 KiloByte blocks)
Mars RC6 Rijndael Serpent Twofish Cast 256 Crypton DEAL DFC E2 Frog HPC Loki 97 Magenta Safer +
EV56 [64 bits] 31.7Kc (*) 34.9Kc (*) 30.6Kc (*) 62.4Kc (*) 30.6Kc (*) 46.8Kc (*) 31.2Kc (*) 172Kc (*) 19.5Kc (*) 36.7Kc (*) 172Kc (*) 9.5Kc (g) 147Kc (*) 317Kc (*) 93.9Kc (*)
The timings are measured for 128 bits keys and 8192 bits messages. We give the number of thousands of cycles per 8192 bits block for the internal ciphering function.
(*)
This is just the timing of 64 successive 128-bits encryptions.
(g)
The author made an optimized implementation of a 512-bits blocksize version of HPC. This is the timing of 16 successive 512-bits encryptions. We don't know the time needed to convert sixteen 512-bits encrypted block to sixty-four 128-bits encrypted blocks, but it should be about the time of sixteen 512-bits decryption and sixty-four 128-bits encryptions... 178 Kcycles.
bitslice
The bitslice technique encrypts in parallel k blocs of 128 bits, where k is the word length of the cpu. Since most cpus have word length at most 64, we measured timings for 64×128=8192 bits messages. We don't include the time for mixing the blocks before encryption and the time for de-mixing after encryption.
To be compatible with non-bitslice implementations, we need to mix the k blocks before running the bitslice cipher. The mixing results in having one word containing the first bits of each block, one word containing the second bits of each block, etc. It is the transposition of a k×k matrix.
For 64-bits cpus, the mixing+demixing is four 64×64 transpositions. It takes 6500 cycles on an EV6 (100 cycles per block), 10300 cycles on an EV56 (160 cycles per block) and 11000 cycles on an 64-bits UltraSparc (170 cycles per block), using Thomas Pornin's tricks. Robert Harley made an implementation of the 64×64 transpose that is a little faster : mixing+demixing takes 6000 cycles on EV6 and 10000 cycles on EV56.
Today, there is no bitslice implementation of an AES candidate.

Back

Thanks

I'd like to thank Jacques Beigbeder and the SPI, Mark Shand and Xavier Bertou for allowing these tests on their computers. Many thanks to Thomas Pornin, Fabrice Noilhan and Robert Harley for fruitful discussions...


Update information

2000/04/20
Updates from AES3 conference presentations.
2000/04/02
Eric Young's timings of RC6 on Genuine Itanium.
2000/02/04
Crypto/UCL Caesar project has done some implementations on smart cards.
1999/04/08
Robert Harley (INRIA) and David Seal (ARM Ltd) StrongARM assembly code for DFC.
UltraSparc code for DFC, using floating point multiplication (Robert Harley).
PA8200 timings, Merced/McKinley estimations (Doug Whiting).
1999/04/01
Faster Pentium II code for E2, as reported in AES2.
Pentium II asm for RC6 by Ted Krovetz runs in 243 cycles instead of 232.
1999/03/08
Optimisation of UltraSparc C code for DFC.
1999/02/28
Pentium assembly code for DFC (Robert Harley).
1999/02/24
Timing for DFC on StrongARM.
Pentium Pro code for DFC is slightly faster (Robert Harley).
1999/02/18
Two separate rows for timings on true Pentium and on 6x86MX, because these are quite different cpus.
Faster AXP code for DFC, assembly by Robert Harley.
1999/02/08
Faster Pentium Pro code for DFC, by Dominik Behr, Robert Harley, Danjel McGougan and Terje Mathisen.
Faster C implementation of Twofish on UltraSparc by Helger Lipmaa.
Faster C implementation of Safer + on Pentium Pro by Cylink team.
1999/01/27
Faster AXP code for DFC, by Robert Harley.
Faster Pentium II code for DFC, by Dominik Behr.
Faster C implementation of Rijndael on UltraSparc by Helger Lipmaa.
1999/01/22
New implementation of DEAL by Brian Gladman, faster on Pentium Pro only.
Timings on Supersparc (superscalar Sparc v8 by TI and Sun, found in SS10/20)
Timings on Alpha EV45.
Timings on HP PA7000.
1999/01/21
Faster asm implementation of Rijndael on Pentium II by Helger Lipmaa.
1999/01/19
Erroneous timings for Twofish on UltraSparc.
Faster implementation of Rijndael by Brian Gladman.
Timings on Turbosparc (non superscalar Sparc v8 by Fujitsu, found in SS5).
1999/01/18
New timings on Alpha EV56, done with Dec cc.
Timings on EV6.
1999/01/14
For Safer +, we have new timings on alpha, using the EV56 byte instructions.
Updating erroneous timings for Intel 486.
Updating erroneous timings for Rijndael.
Faster 64 bits C code for DFC, by Fabrice Noilhan.
New Twofish implementation by Brian Gladman, faster on Pentium Pro and UltraSparc, but not on Alpha nor Cyrix.
1999/01/13
New timings, Crypton version 1.0
Timings on UltraSparc, compilation with SW5.0 cc.
1999/01/11
Faster AXP code for DFC, by Robert Harley.
Adding timings for Brian Gladman's new code on Alpha, UltraSparc and other machines.
1999/01/08
Adding an empty "Best bitslice timings" table, hoping someone will make a bitslice implementation of some AES candidates...
Adding new Brian Gladman's timings for MVC on Pentium Pro.
1999/01/06
Kenneth Almquist's estimations of speeds at AXP 21164
New timing for Twofish on Pentium Pro.
1998/12/22
HPC best timing on Alpha : author's timings.