JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

A visual analysis approach for data imputation via multi-party tabular data correlation strategies

Author(s): Haiyang ZHU, Dongming HAN, Jiacheng PAN, Yating WEI, Yingchaojie FENG, Luoxuan WENG, Ketian MAO, Yuankai XING, Jianshu LV, Qiucheng WAN, Wei CHEN
Affiliation(s): The State Key Lab of CAD & CG, Zhejiang University, Hangzhou 310058, China; more
Corresponding email(s): hnsyzhy@zju.edu.cn, chenvis@zju.edu.cn
Key Words: Data governance; Data incompleteness; Data imputation; Data visualization; Interactive visual analysis

Share this article to： More <<< Previous Paper \|Next Paper >>>

Haiyang ZHU, Dongming HAN, Jiacheng PAN, Yating WEI, Yingchaojie FENG, Luoxuan WENG, Ketian MAO, Yuankai XING, Jianshu LV, Qiucheng WAN, Wei CHEN. A visual analysis approach for data imputation via multi-party tabular data correlation strategies[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2300480

@article{title="A visual analysis approach for data imputation via multi-party tabular data correlation strategies",
author="Haiyang ZHU, Dongming HAN, Jiacheng PAN, Yating WEI, Yingchaojie FENG, Luoxuan WENG, Ketian MAO, Yuankai XING, Jianshu LV, Qiucheng WAN, Wei CHEN",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2300480"
}

%0 Journal Article
%T A visual analysis approach for data imputation via multi-party tabular data correlation strategies
%A Haiyang ZHU
%A Dongming HAN
%A Jiacheng PAN
%A Yating WEI
%A Yingchaojie FENG
%A Luoxuan WENG
%A Ketian MAO
%A Yuankai XING
%A Jianshu LV
%A Qiucheng WAN
%A Wei CHEN
%J Frontiers of Information Technology & Electronic Engineering
%P 398-414
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2300480"

TY - JOUR
T1 - A visual analysis approach for data imputation via multi-party tabular data correlation strategies
A1 - Haiyang ZHU
A1 - Dongming HAN
A1 - Jiacheng PAN
A1 - Yating WEI
A1 - Yingchaojie FENG
A1 - Luoxuan WENG
A1 - Ketian MAO
A1 - Yuankai XING
A1 - Jianshu LV
A1 - Qiucheng WAN
A1 - Wei CHEN
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 398
EP - 414
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2300480"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Data imputation is an essential pre-processing task for data governance, aimed at filling in incomplete data. However, conventional data imputation methods can only partly alleviate data incompleteness using isolated tabular data, and they fail to achieve the best balance between accuracy and efficiency. In this paper, we present a novel visual analysis approach for data imputation. We develop a multi-party tabular data association strategy that uses intelligent algorithms to identify similar columns and establish column correlations across multiple tables. Then, we perform the initial imputation of incomplete data using correlated data entries from other tables. Additionally, we develop a visual analysis system to refine data imputation candidates. Our interactive system combines the multi-party data imputation approach with expert knowledge, allowing for a better understanding of the relational structure of the data. This significantly enhances the accuracy and efficiency of data imputation, thereby enhancing the quality of data governance and the intrinsic value of data assets. Experimental validation and user surveys demonstrate that this method supports users in verifying and judging the associated columns and similar rows using their domain knowledge.

基于多方表格数据关联策略的数据补全可视分析方法

朱海洋^1,2，韩东明¹，潘嘉铖¹，魏雅婷³，封颖超杰¹，翁罗轩¹，毛科添¹，邢远凯²，闾建树²，万邱成²，陈为¹
¹浙江大学计算机辅助设计与图形系统全国重点实验室，中国杭州市，310058
²物产中大数字科技有限公司，中国杭州市，310020
³物产中大金属集团有限公司，中国杭州市，310005
摘要：数据补全是数据治理的一项重要预处理任务，目的是填补不完整的数据。然而，传统的数据补全方法只能通过单张数据表格在一定程度上缓解数据的不完整问题，并未能在补全值的准确性和效率之间达到最佳平衡。本文提出了一种新颖的数据补全可视化分析方法；设计了一套多方表格数据关联策略，采用智能算法识别相似列并在多个表格之间建立列之间的关联关系，然后利用其它表格中的相似数据条目对缺失数据进行初始补全；开发了一个可视分析系统来优化数据补全的候选值。本文中的交互式系统将多方数据补全方法与专家知识相结合，有助于更好地理解数据的关系结构，显著提高了数据补全的准确性和效率，提升了数据治理质量和数据资产内在价值。实验验证和用户调查表明，本文方法支持用户使用领域知识验证判断相关列及相似行。

关键词组：数据治理；数据不完整；数据补全；数据可视化；交互式可视分析

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Ahuja S, Roth M, Gangadharaiah R, et al., 2016. Using machine learning to accelerate data wrangling. Proc IEEE 16^th Int Conf on Data Mining Workshops, p.343-349.

[2]Arbesser C, Spechtenhauser F, Mühlbacher T, et al., 2017. Visplause: visual data quality assessment of many time series using plausibility checks. IEEE Trans Visual Comput Graph, 23(1):641-650.

[3]Azur MJ, Stuart EA, Frangakis C, et al., 2011. Multiple imputation by chained equations: what is it and how does it work? Int J Methods Psych Res, 20(1):40-49.

[4]Bernard J, Hutter M, Reinemuth H, et al., 2019. Visual-interactive preprocessing of multivariate time series data. Comput Graph Forum, 38(3):401-412.

[5]Bernhard J, Cella DF, Coates AS, et al., 1998. Missing quality of life data in cancer clinical trials: serious problems and challenges. Statist Med, 17(5-7):517-532.

[6]Bögl M, Filzmoser P, Gschwandtner T, et al., 2015. Visually and statistically guided imputation of missing values in univariate seasonal time series. Proc IEEE Conf on Visual Analytics Science Technology, p.189-190.

[7]Bonneau GP, Hege HC, Johnson CR, et al., 2014. Overview and state-of-the-art of uncertainty visualization. In: Hansen CD, Chen M, Johnson CR (Eds.), Scientific Visualization: Uncertainty, Multifield, Biomedical, and Scalable Visualization. Springer, London, UK, p.3-27.

[8]Bors C, Gschwandtner T, Miksch S, 2015. QualityFlow: provenance generation from data quality. Proc EuroVIS Conf on Visualization Posters Track.

[9]Bors C, Bögl M, Gschwandtner T, et al., 2017. Visual support for rastering of unequally spaced time series. Proc 10^th Int Symp on Visual Information Communication and Interaction, p.53-57.

[10]Buono P, Aris A, Plaisant C, et al., 2005. Interactive pattern search in time series. Proc SPIE 5669, Visualization and Data Analysis, p.175-186.

[11]Chai XT, Gu HM, Li F, et al., 2020. Deep learning for irregularly and regularly missing data reconstruction. Sci Rep, 10(1):3302.

[12]Chen W, Zhang TY, Zhu HY, et al., 2021. Perspectives on cross-domain visual analysis of cyber-physical-social big data. Front Inform Technol Electron Eng, 22(12):1559-1564.

[13]Djurcilov S, Pang A, 1999. Visualizing gridded datasets with large number of missing values. Proc Visualization, p.405-408.

[14]Eaton C, Plaisant C, Drizd T, 2005. Visualizing missing data: classification and empirical study. Proc IFIP Int Conf on Human–Computer Interaction, p.861-872.

[15]Emmanuel T, Maupong T, Mpoeleng D, et al., 2021. A survey on missing data in machine learning. J Big Data, 8(1):140.

[16]Enders CK, 2022. Applied Missing Data Analysis. Methodology in the Social Sciences Series (2^nd Ed.). Guilford Press, New York, USA.

[17]Fernstad SJ, Glen RC, 2014. Visual analysis of missing data—To see what isn’t there. Proc IEEE Conf on Visual Analytics Science Technology, p.249-250.

[18]Furche T, Gottlob G, Libkin L, et al., 2016. Data wrangling for big data: challenges and opportunities. Proc 19^th Int Conf on Extending Database Technology, p.473-478.

[19]Gao J, 2006. Adaptive interpolation algorithms for temporal-oriented datasets. Proc 13^th Int Symp on Temporal Representation and Reasoning, p.145-151.

[20]Githungo W, Otengi S, Wakhungu J, et al., 2016. Infilling monthly rain gauge data gaps with satellite estimates for ASAL of Kenya. Hydrology, 3(4):40.

[21]Griethe H, Schumann H, 2006. The visualization of uncertain data: methods and problems. Proc SimVis, p.143-156.

[22]Gschwandtner T, Gärtner J, Aigner W, et al., 2012. A taxonomy of dirty time-oriented data. Proc Int Conf on Availability, Reliability, and Security, p.58-72.

[23]Gülensoy K, Gawrilow C, von Landesberger T, 2014. Visual exploration of dirty activity sensor and emotional state data from psychological experiments. Proc 14^th Int Conf on Knowledge Technologies and Data-Driven Business, Article 19.

[24]Gupta M, Soeny K, 2021. Algorithms for rapid digitalization of prescriptions. Visual Inform, 5(3):54-69.

[25]Harlim J, Jiang SW, Liang SW, et al., 2021. Machine learning for prediction with missing dynamics. J Comput Phys, 428:109922.

[26]Huang G, Guo C, Kusner MJ, et al., 2016. Supervised word mover’s distance. Proc 30^th Int Conf on Neural Information Processing Systems, p.4869-4877.

[27]Kamal A, Dhakal P, Javaid AY, et al., 2021. Recent advances and challenges in uncertainty visualization: a survey. J Visual, 24(5):861-890.

[28]Kandel S, Heer J, Plaisant C, et al., 2011. Research directions in data wrangling: visualizations and transformations for usable and credible data. Inform Visual, 10(4):271-288.

[29]Kang H, 2013. The prevention and handling of the missing data. Korean J Anesthesiol, 64(5):402-406.

[30]Kim W, Choi BJ, Hong EK, et al., 2003. A taxonomy of dirty data. Data Min Knowl Discov, 7(1):81-99.

[31]Kök İ, Özdemir S, 2021. DeepMDP: a novel deep-learning-based missing data prediction protocol for IoT. IEEE Int Things J, 8(1):232-243.

[32]Kusner M, Sun Y, Kolkin N, et al., 2015. From word embeddings to document distances. Proc 32^nd Int Conf on Machine Learning, p.957-966.

[33]Lajeunesse MJ, 2013. Recovering missing or partial data from studies: a survey of conversions and imputations for meta-analysis. In: Koricheva J, Gurevitch J, Mengersen K (Eds.), Handbook of Meta-Analysis in Ecology and Evolution. Princeton University Press, Princeton, USA, p.195-206.

[34]Little RJA, Rubin DB, 2002. Statistical Analysis with Missing Data (2^nd Ed.). John Wiley & Sons, New York, USA.

[35]Liu YJ, Fang YJ, Zhu XM, 2010. Modeling of hydraulic turbine systems based on a Bayesian–Gaussian neural network driven by sliding window data. J Zhejiang Univ Sci C (Comput & Electron), 11(1):56-62.

[36]Luo Y, 2022. Evaluating the state of the art in missing data imputation for clinical data. Brief Bioinform, 23(1):bbab489.

[37]Marlin BM, 2008. Missing Data Problems in Machine Learning. PhD Thesis, University of Toronto, Toronto, Canada.

[38]Mazumder R, Hastie T, Tibshirani R, 2010. Spectral regularization algorithms for learning large incomplete matrices. J Mach Learn Res, 11:2287-2322.

[39]McCarthy JD, Graniero PA, 2006. A GIS-based borehole data management and 3D visualization system. Comput Geosci, 32(10):1699-1708.

[40]Miao XY, Wu YY, Chen L, et al., 2023. An experimental survey of missing data imputation algorithms. IEEE Trans Knowl Data Eng, 35(7):6630-6650.

[41]Nijman SWJ, Leeuwenberg AM, Beekers I, et al., 2022. Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review. J Clin Epidemiol, 142:218-229.

[42]Palocsay SW, Markham IS, Markham SE, 2010. Utilizing and teaching data tools in Excel for exploratory analysis. J Bus Res, 63(2):191-206.

[43]Pedregosa F, Varoquaux G, Gramfort A, et al., 2011. Scikit-learn: machine learning in Python. J Mach Learn Res, 12:2825-2830.

[44]Rässler S, 2004. Data fusion: identification problems, validity, and multiple imputation. Austr J Stat, 33(1-2):153-171.

[45]Raubenheimer J, 2017. Excel-lence in data visualization?: the use of Microsoft Excel for data visualization and the analysis of big data. In: Prodromou T (Ed.), Data Visualization and Statistical Literacy for Open and Big Data. IGI Global Information Science Reference, Hershey, Pennsylvania, USA, p.153-193.

[46]Rubinsteyn A, Feldman S, 2016. Fancyimpute: an Imputation Library for Python (Version: 0.7.0). https://github.com/iskandr/fancyimpute

[47]Scheffer J, 2002. Dealing with missing data. Res Lett Inform Math Sci, 3(1):153-160.

[48]Smith DM, 2003. The cost of lost data. J Contemp Bus Pract, 6(3):1-9.

[49]Stekhoven DJ, Bühlmann P, 2012. MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112-118.

[50]Sun YJ, Li J, Chen SM, et al., 2022. A learning-based approach for efficient visualization construction. Visual Inform, 6(1):14-25.

[51]Swayne DF, Buja A, 1998. Missing data in interactive high-dimensional data visualization. Comput Stat, 13(1):15-26.

[52]Templ M, Alfons A, Filzmoser P, 2012. Exploring incomplete data using visualization techniques. Adv Data Anal Classif, 6(1):29-47.

[53]Troyanskaya O, Cantor M, Sherlock G, et al., 2001. Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520-525.

[54]Turkay C, Lundervold A, Lundervold AJ, et al., 2012. Representative factor generation for the interactive visual analysis of high-dimensional data. IEEE Trans Visual Comput Graph, 18(12):2621-2630.

[55]Twiddy R, Cavallo J, Shiri SM, 1994. Restorer: a visualization technique for handling missing data. Proc Visualization, p.212-216.

[56]Unwin A, Hawkins G, Hofmann H, et al., 1996. Interactive graphics for data sets with missing values—MANET. J Comput Graph Stat, 5(2):113-122.

[57]Wang HN, Liu N, Zhang YY, et al., 2020. Deep reinforcement learning: a survey. Front Inform Technol Electron Eng, 21(12):1726-1744.

[58]Wang XM, Wu ZL, Huang WQ, et al., 2023. VIS+AI: integrating visualization with artificial intelligence for efficient data analysis. Front Comput Sci, 17(6):176709.

[59]Wong BLW, Varga M, 2012. Black holes, keyholes and brown worms: challenges in sense making. Proc Human Factors Ergon Soc Annu Meet, 56(1):287-291.

[60]Wu LF, Yen IEH, Xu K, et al., 2018. Word mover’s embedding: from Word2Vec to document embedding. Proc Conf on Empirical Methods in Natural Language Processing, p.4524-4534.

[61]Wu ZL, Chen W, Ma YX, et al., 2023. Explainable data transformation recommendation for automatic visualization. Front Inform Technol Electron Eng, 24(10):1007-1027.

[62]Yang Y, Zhuang YT, Pan YH, 2021. Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Front Inform Technol Electron Eng, 22(12):1551-1558.

[63]Yi XW, Zheng Y, Zhang JB, et al., 2016. ST-MVL: filling missing values in geo-sensory time series data. Proc 25^th Int Joint Conf on Artificial Intelligence, p.2704-2710.

[64]Yin S, Wang G, Yang X, 2014. Robust PLS approach for KPI-related prediction and diagnosis against outliers and missing data. Int J Syst Sci, 45(7):1375-1382.

[65]Zhang GF, Zhu ZH, Zhu SJ, et al., 2022. Towards a better understanding of the role of visualization in online learning: a review. Visual Inform, 6(4):22-33.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

基于多方表格数据关联策略的数据补全可视分析方法

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference