Zeige Artikel getaggt mit data analysis

Veröffentlicht von am in CS Science
Handy data cleaning tool - CSV fingerprints

Recently I stumbled upon a handy little tool that may be interesting for everyone working with data in tables. An important but often tedious task is the cleaning of your dataset before you can actually start running statistical analyses. During this cleaning or mastering process you may find artifacts like the following:

  • Entries with unexpected data types: When test takers were expected to describe something in prose but a few entered a number instead.
  • Empty cells where no missing values are allowed: Maybe a mistake when entering paper pencil data manually.
  • A sudden shift of cell values to the right, causing a lot of values to fall into the wrong column: This happens, when data separation characters are used in the data itself.

If you've ever worked with larger sets of data, you surely know these or similar problems and have experience how hard it can be to spot them.

CSV Fingerprints gives you a very quick first visual of your data and can therefore save you a lot of time. Victor Powell, the author of this handy tool explains CSV Fingerprints in more details on his blog. There is also a full screen version of the tool available.

Tip: Don't try to copy&paste data directly from Excel, always copy the CSV from a text editor.

In the wake of our recent posts about longitudinal studies we'd like to recommend a recently published book by By John J. McArdle and John R. Nesselroade.


Longitudinal studies are on the rise, no doubt. Properly conducting longitudinal studies and then analyzing the data can be a complex undertaking. John McArdle and John Nesselroade focus on five basic questions that can be tackled with structural equation models, when analyzing longitudinal data:

  • Direct identification of intraindividual changes.
  • Direct identification of interindividual differences in intraindividual changes.
  • Examining interrelationships in intraindividual changes.
  • Analyses of causes (determinants) of intraindividual changes.
  • Analyses of causes (determinants) of interindividual differences in intraindividual changes.

I find it especially noteworthy, that the authors put an emphasis on factorial invariance over time and latent change scores. In my view, this makes this book a must read to become a longitudinal data wizard.

Need another argument? Afraid of cumbersome mathematical language? Here is what the authors say about it: „We focus on the big picture approach rather than the algebraic details.“

Longitudinal research has largely increased in the past 20 years due to an advanced development of new theories and methodologies. Nevertheless, studies in social sciences are still mainly dominated by cross-sectional research designs or deficient longitudinal research, because many researcher lack guidelines for conducting adequate longitudinal research to interpret the duration and change in constructs and variables.

To create a more systematic approach to longitudinal research, Ployhart and Ward (2011) have created a quick start guide on how to conduct high quality longitudinal research.

The following information refers to three stages: the theoretical development of the study design, the analysis of longitudinal results and relevant tips for publishing the respective research. The most relevant information provided by the authors will be shared subsequently in form of a checklist which can help you ameliorate your research ideas and design:

Why is longitudinal research important?

It helps to investigate not only the relationship of two variables over time, but allows to disentangle the direction of effects. It also helps to investigate the change of a variable over time and the duration of this change.  For instance one might investigate how job satisfaction of new hires changes over time and whether certain features of the job (i.e., feedback by the supervisor) predict the form of change. Such questions can only be analyzed through longitudinal investigation with repeated measurements of the construct. In order to study change, at least three waves of data are necessary for a well conducted longitudinal research study (Ployhart & Vandenberg, 2010).

What sample size is needed to conduct longitudinal research?

Since the estimation of power is a complex issue in longitudinal research, the authors do give a rather general answer to this question:  “the answer to this is easy—as large as you can get!“ However, they give a useful rule of thumb. The statistical power depends among other things on the number of subjects and on the number of repeated measures. „If one must choose between adding subjects versus measurement occasions, our recommendation is to first identify the minimum number of repeated measurements required to adequately test the hypothesized form of change and then maximize the number of subjects.“

When to administer measures?

When studying change over time, the timing of measurement is crucial (Mitchell & James 2001). The measurement spacing should adequately capture the expected form of change. Spacing will be different for a linear change as compared to non-linear (e.g., exponential or logarithmic) change. Such thinking is still contrary to common practice. Most of the study designs focus on evenly spaced measurement occasions and give rather sparse focus on the type of change under study. However, it is important that measurement waves occur with enough frequency and cover the theoretically important temporal parts of the change. This needs careful theoretical reasoning beforehand. Done otherwise, the statistical models will over- or underestimate the true nature of the changes under study.

Be it a longitudinal study or a diary study the software of cloud solutions can handle any type of timing and frequency between measurement occasions. The flexibility of our online solutions stem from an “event flow engine” that is based on neural networks.

What to do about missing data?

The statistical analysis of longitudinal research can become complex. One particular challenge in longitudinal data is the treatment of missing data. However, since longitudinal studies often suffer from high dropout rates, having missing data is a very common phenomenon. Here you find recommendations to reduce missing data before and during data collection.  When conducting surveys in organizations a way to enhance response rate is to make sure that the company allows their workers to complete the survey during working hours. A specific technique to reduce the burden on individual participants and still measure frequently over a longer time is planned missingness.

When it comes to handling missing data in statistical analyses, the most important question is whether the data are missing at random or not. If the data are missing at random, there is not much to worry about. The use of full information maximum likelihood estimates will provide unbiased estimates of the missing data points. If the data are not missing at random more sophisticated analytical techniques may be required. Ployhart and Ward (2011) recommend Little and Rubin (2002) for further readings on this issue.

Which analytical method to use?

Simply put, there are three statistical frameworks that can be used to model longitudinal data.

  • Repeated measures General Linear Model: Useful when the focus of interest lies on mean changes within persons over time and missing data is unproblematic.
  • Random coefficient modeling: Useful when one is interested in between – person differences in change over time. Especially useful when the growth models are simple and the predictors of change are static.
  • Structural equation modeling: Useful when one is interested in between – person differences in change over time. Especially useful when with more complex growth models, including time-varying predictors, dynamic relationships, or mediated change.

The following table from Ployhart and Ward (2011) gives a more detailed insight into the application of the three methods:

Use the following method... ...when these conditions are present
Repeated measures general linear model Focus on group mean change
  Identify categorial predictors of change (e.g. training vs. control group)
  Assumptions with residuals are reasonably met
  Two waves of repeated data
  Variables are highly reliable
  Little to no missing data
Random coefficient modeling Focus on individual differences in change over time
  Identify continuous or categorial predictors of change
  Residuals are correlated, heterogeneous etc.
  Three or more waves of data
  Variables are highly reliable
  Model simple mediated or dynamic models
  Missing data are random
 Structural equation modeling Focus on individual differences in change over time
  Identify continuous or categorial predictors of change
  Residuals are correlated, heterogeneous, etc.
  Three or more waves of data
  Want to remove unreliability
  Model complex mediated or dynamic models


How to make a relevant theoretical contribution worth publishing?

When publishing longitudinal research you should always describe why your longitudinal research is better at explaining the constructs and their relationship than equivalent cross-sectional designs. Then you should underline the superiority of study design as compared to previous ones. Try to go through the following questions when justifying your research’s worth for being published:

  • Have you developed hypotheses from a cross-sectional or from a longitudinal theory?
  • Have you explained why change occurs in your constructs?
  • Have you described why you measured the variables at various times and how this constitutes a sufficient sampling rate?
  • Have you considered threats to internal validity?
  • Have you explained how you reduced missing data?
  • Have you explained why you chose this analytical method?

cloud solutions wishes you success with your longitudinal research!

Show page in

Was Kunden
über uns sagen

Dank der Expertise und dem Know-how der Mitarbeiter von cloud solutions wurden wir auch in schwierigen Situationen stets optimal beraten. Wir können cloud solutions ohne Bedenken weiter empfehlen.

Prof. Dr. Norbert K. Semmer
Psychologisches Institut, Universität Bern, Schweiz

The future of the PHP PaaS is here: Our journey to Platform.sh
CS Tech
In our team we’re very confident in our ability to produce high quality software. For the past decad...