CLC number: TN47
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2008-11-10
Cited: 0
Clicked: 5826
Jian-ying PENG, Xiao-lang YAN, De-xian LI, Li-zhong CHEN. A parallel memory architecture for video coding[J]. Journal of Zhejiang University Science A, 2008, 9(12): 1644-1655.
@article{title="A parallel memory architecture for video coding",
author="Jian-ying PENG, Xiao-lang YAN, De-xian LI, Li-zhong CHEN",
journal="Journal of Zhejiang University Science A",
volume="9",
number="12",
pages="1644-1655",
year="2008",
publisher="Zhejiang University Press & Springer",
doi="10.1631/jzus.A0820052"
}
%0 Journal Article
%T A parallel memory architecture for video coding
%A Jian-ying PENG
%A Xiao-lang YAN
%A De-xian LI
%A Li-zhong CHEN
%J Journal of Zhejiang University SCIENCE A
%V 9
%N 12
%P 1644-1655
%@ 1673-565X
%D 2008
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.A0820052
TY - JOUR
T1 - A parallel memory architecture for video coding
A1 - Jian-ying PENG
A1 - Xiao-lang YAN
A1 - De-xian LI
A1 - Li-zhong CHEN
J0 - Journal of Zhejiang University Science A
VL - 9
IS - 12
SP - 1644
EP - 1655
%@ 1673-565X
Y1 - 2008
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.A0820052
Abstract: To efficiently exploit the performance of single instruction multiple data (SIMD) architectures for video coding, a parallel memory architecture with power-of-two memory modules is proposed. It employs two novel skewing schemes to provide conflict-free access to adjacent elements (8-bit and 16-bit data types) or with power-of-two intervals in both horizontal and vertical directions, which were not possible in previous parallel memory architectures. Area consumptions and delay estimations are given respectively with 4, 8 and 16 memory modules. Under a 0.18-μm CMOS technology, the synthesis results show that the proposed system can achieve 230 MHz clock frequency with 16 memory modules at the cost of 19k gates when read and write latencies are 3 and 2 clock cycles, respectively. We implement the proposed parallel memory architecture on a video signal processor (VSP). The results show that VSP enhanced with the proposed architecture achieves 1.28× speedups for H.264 real-time decoding.
[1] Aho, E., Vanne, J., Kuusilinna, K., Hamalainen, T.D., 2004. Address computation in configurable parallel memory architecture. IEICE Trans. on Inf. Syst., 87(7):1674-1681.
[2] Aho, E., Vanne, J., HÄmÄlÄinen, T.D., 2007. Configurable data memory for multimedia processing. J. VLSI Signal Processing, 50(2):231-249.
[3] Budnik, P., Kuck, D.J., 1971. The organization and use of parallel memories. IEEE Trans. on Comput., C-20(12):1566-1569.
[4] Cheresiz, D., Juurlink, B., Vassiliadis, S., Wijshoff, H.A.G., 2005. The CSI multimedia architecture. IEEE Trans. on VLSI Syst., 13(1):1-13.
[5] Corbal, J., Valero, M., Espasa, R., 1999. Exploiting a New Level of DLP in Multimedia Applications. Proc. Int. Symp. on Microarchitecture, p.72-79.
[6] Deb, A., 1996. Multiskewing—a novel technique for optimal parallel memory access. IEEE Trans. on Parall. Distrib. Syst., 7(6):595-604.
[7] Frailong, J.M., Jalby, W., Lenfant, J., 1985. XOR-Schemes: A Flexible Data Organization in Parallel Memories. Proc. Int. Conf. on Parallel Processing, p.276-283.
[8] Gossel, M., Rebel, B., Creutzburg, R., 1994. Memory Architecture & Parallel Access. Elsevier Science Inc., New York, USA, p.250.
[9] Khailany, B., Dally, W.J., Kapasi, U.J., Mattson, P., Namkoong, J., Owens, J.D., Towles, B., Chang, A., Rixner, S., 2001. Imagine: media processing with streams. IEEE Micro., 21(2):35-46.
[10] Kozyrakis, C., Patterson, D., 2002. Vector vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks. Proc. Int. Symp. on Microarchitecture, p.283-293.
[11] Kozyrakis, C.E., Patterson, D.A., 2003. Scalable vector processors for embedded systems. IEEE Micro., 23(6):36-45.
[12] Lee, R.B., 2000. Subword Permutation Instructions for Two-dimensional Multimedia Processing in MicroSIMD Architectures. Proc. IEEE Int. Conf. on Application— Specific Systems, Architectures and Processors, p.3-14.
[13] Li, L., Goto, S., Ikenaga, T., 2005. An Efficient Deblocking Filter Architecture with 2-Dimensional Parallel Memory for H.264/AVC. Proc. Asia and South Pacific Design Automation Conf., p.623-626.
[14] Liu, K.J., Qin, X., Yan, X.L., Quan, Li, 2006. A SIMD Video Signal Processor with Efficient Data Organization. Proc. IEEE Asia Solid-State Circuits Conf., p.115-118.
[15] Park, J.K., 2004. Multiaccess memory system for attached SIMD computer. IEEE Trans. on Comput., 53(4):439-452.
[16] Sohi, G.S., 1993. High-bandwidth interleaved memories for vector processors—a simulation study. IEEE Trans. on Comput., 42(1):34-44.
[17] Talla, D., John, L.K., Burger, D., 2003. Bottlenecks in multimedia processing with SIMD style extensions and architectural enhancements. IEEE Trans. on Comput., 52(8):1015-1031.
[18] Tanskanen, J., Sihvo, T., Niittylahti, J., Takala, J., Creutzburg, R., 2000. Parallel Memory Access Schemes for H.263 Encoder. Proc. IEEE Int. Symp. on Circuits and Systems, p.691-694.
[19] Tanskanen, J.K., Sihvo, T., Niittylahti, J.T., 2004. Byte and modulo addressable parallel memory architecture for video coding. IEEE Trans. on Circuits Syst. Video Technol., 14(11):1270-1276.
[20] Tanskanen, J.K., Creutzburg, R., Niittylahti, J.T., 2005. On design of parallel memory access schemes for video coding. J. VLSI Signal Processing, 40(2):215-237.
[21] Trenas, M.A., Opez, J., Arguello, F., Zapata, E.L., 1998. A Memory System Supporting the Efficient SIMD Computation of the Two Dimensional DWT. Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, p.1521-1524.
Open peer comments: Debate/Discuss/Question/Opinion
<1>