1. Introduction to Descriptive Statistics
Descriptive statistics are the foundational tools for summarizing, organizing, and visualizing data in both clinical and epidemiological studies. Before moving on to more complex analytical methods (like regression or hypothesis testing), it’s crucial to:
Identify the type of data (numerical vs. categorical).
Use appropriate descriptive measures (mean, median, proportions).
Visually inspect data distributions (histograms, boxplots, bar charts).
These steps allow researchers to understand the “shape” of the data, spot potential errors or outliers, and communicate findings clearly.
2. Types of Data and Variables
A. Numerical Data
Continuous: Variables that can take on any value within a range or interval, theoretically without breaks.
Examples: Serum creatinine level, body weight, blood pressure, blood glucose levels.
Discrete: Variables that can only take specific (often integer) values.
Examples: Number of hospital admissions, number of adverse events, number of children in a family.
B. Categorical Data
Nominal: Categories without any intrinsic order.
Examples: Blood group (A, B, AB, O), type of health insurance, eye color.
Ordinal: Categories that have a meaningful order, but intervals between categories may not be equal.
Examples: Staging of cancer (Stage I, II, III, IV), Likert scale responses (Strongly disagree, Disagree, Neutral, Agree, Strongly agree), NYHA functional classes (I, II, III, IV).
Key Takeaway: Correctly classifying variables ensures that the right descriptive statistics and visualization methods are applied.
3. Descriptive Statistics for Numerical Data
A. Distribution-Based Checks
Before computing summary statistics, it is helpful to assess the distribution of your numerical data:
Normality: A “bell-shaped” or Gaussian distribution is common in many biological measurements, but not guaranteed.
Skewness: Indicates asymmetry in the distribution. A right-skew (tail to the right) often arises in variables like hospital length of stay or annual income.
Kurtosis: Describes the “tailedness” or peak of the distribution.
Diagnostic Tools:
Histogram: Quick visual to see if data cluster around the mean or exhibit skew.
Boxplot: Reveals median, quartiles, and potential outliers.
Q-Q Plot: Checks normality by comparing quantiles of the sample data to a normal distribution.
B. Measures of Central Tendency
Mean (Arithmetic Average)
Definition: Sum of all observations divided by the number of observations.
Use Case: Data that appear to be roughly symmetric with few extreme outliers.
Interpretation: Represents the “balance point” of the distribution.
Median
Definition: The middle value when observations are sorted.
Use Case: Skewed or heavily outlier-prone data (e.g., incomes, hospital lengths of stay).
Interpretation: Half of the observations lie below and half above the median.
C. Measures of Dispersion
Standard Deviation (SD)
Definition: Average amount by which each observation deviates from the mean.
Use Case: Paired with the mean in symmetrically distributed data.
Interpretation: Indicates clustering of data around the mean; a small SD means points are tight around the mean.
Interquartile Range (IQR)
Definition: The difference between the 75th percentile (Q3) and 25th percentile (Q1).
Use Case: When the median is used (due to skew or outliers).
Interpretation: Contains the central 50% of data points, providing a more robust measure of spread for non-normal distributions.
4. Descriptive Statistics for Categorical Data
When dealing with nominal or ordinal variables, the following are standard ways to summarize:
Frequency Counts
Definition: The number of observations falling into each category.
Example: If your dataset has 100 patients and 30 of them have type 2 diabetes, the count is 30 for that category.
Percentages or Proportions
Definition: The fraction (or percentage) of observations in each category.
Example: In the example above, 30 out of 100 patients with type 2 diabetes translates to 30%.
Additional Tip: For ordinal data, you can also present cumulative frequencies (e.g., proportion at or below a certain stage of disease).
5. Visualization Techniques
Visualization is a powerful aid in understanding and communicating data distributions. Appropriate charts and plots depend on the type of variable:
Histograms (Numerical Data)
Show the frequency (or density) of observations binned across intervals of continuous or discrete data.
Great for spotting skew, multi-modal distributions (multiple peaks), or outliers.
Boxplots (Numerical Data)
Display median, IQR, and outliers.
Very useful for comparing distributions across multiple groups (e.g., comparing blood pressure across different treatment arms).
Bar Charts (Categorical Data)
Display counts or percentages in categories.
Simple, clear way to communicate categorical frequencies.
Pie Charts (Categorical Data)
Illustrate how a whole is divided among categories.
Less commonly recommended in scientific literature compared to bar charts, as it can be harder to compare relative sizes of slices precisely.
6. Clinical Relevance of Descriptive Statistics
Quality Control: Basic descriptive statistics are often the first step in identifying data-entry errors (e.g., an outlier that is clearly a mistyping).
Contextual Understanding: Clinically, understanding the distribution of age, gender, and comorbidities helps gauge whether a study population matches one’s own patients.
Hypothesis Formation: Observations about skewed distributions or unusual frequency counts can lead to new hypotheses or sub-analyses.
Example: In a study of systolic blood pressure among hypertensive patients, a histogram revealing a heavy right tail might suggest non-normal distribution. Consequently, you’d likely use a median (IQR) to summarize central tendency and variability, instead of mean (SD).
7. Conclusion
Descriptive statistics form the backbone of any scientific investigation. By properly classifying your variables and choosing the correct measures of central tendency and spread, you can provide an accurate and meaningful summary of your dataset. Coupled with the right visualization techniques, descriptive statistics lay the foundation for all subsequent inferential analyses and evidence-based conclusions in clinical research.
Comentarios