# Identification of outliers and influential points in principal component analysis and the impact of removing them on the models

The subject of outliers has been studied for a long time. An outlier is defined as an observation which falls far from the main body of the data. Some observations may not be identified as outliers, yet at the same time have an excessive role in estimating the model parameters. These points are called influential points. Identification of outliers and influential points in Principal Component Analysis (PCA) depends on correlation coefficient criterion. The purpose of this study is to develop a method to identify outliers and influential points in PCA and to determine the impact of removing either one of both on the number of components and the value of the eigenvalues in the model Methods have been developed based on Comrey's work 1985, to detect outliers in Factor Analysis. An observation is defined as an outlier if the difference between its average standard score cross product and that based on all the data is large based on a specific $\alpha$ level. An influential point is defined as a point for which there are large differences between the average score cross product for all data and for all data except this point This method was applied to simulated data sets from the multivariate normal distribution based on high, low and general correlation matrices. These data sets varied in sample size (100-1000 cases), and number of variables (7-12). Some data sets included two binary variables. Others were contaminated with observations with skewed distributions. This method was also applied to two existing data sets, Talent data and Pulmonary Function data The results obtained from this study showed that when the correlation matrix was low, or when the number of variables was $<$10, or when the data were contaminated, or when the sample size was small, removing outliers or influential points is most likely to change both the number of components and the value of the eigenvalues. Generally, removing outliers and/or influential points decreased the value of the eigenvalues, but if the number of components increased, after removing outliers and/or influential points, the eigenvalues increased Application of this method to TALENT data showed that deleting outlier observations and or influential points made it easier to interpret the components. Also, the importance of some variables changed Applying this method to Pulmonary Function data changed the number of latent components in the data from three to four. Some variables were only important when outliers or influential observations were deleted In conclusion the study suggested the importance of identifying outliers and influential points before conducting the analysis using PCA. This study pointed also to the need for future studies aimed at identifying the distribution of the influential points