How to Do Cholesterol Distribution Analysis with Jupyter Notebook

Cholesterol is an important part of health studies. High cholesterol increases the risk of heart disease. In this blog, we will learn how to use Python in Jupyter Notebook to study the cholesterol distribution of US males aged 40–60 years using NHANES 2021–2023 survey data.

We will go step by step: import libraries, create data, adjust it to match real survey results, save as CSV, visualize, and calculate statistics.

Step 1: Import Libraries

We need a few Python libraries to do calculations, handle data, and create graphs.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

Explanation:

numpy → works with numbers and arrays.
pandas → works with tables and data files.
matplotlib → makes charts and graphs.
scipy.stats → helps with statistics and probability.

Step 2: Create Cholesterol Data

We know from research that the average cholesterol in this age group is about 200 mg/dL and the spread (standard deviation) is 42 mg/dL. Let’s create a normal distribution with this information.

mean_cholesterol = 200     # average value
std_cholesterol = 42       # spread of values

# cholesterol values from 50 to 400 in steps of 5
cholesterol_levels = np.arange(50, 405, 5)

# normal distribution with given mean and spread
cholesterol_dist = stats.norm(mean_cholesterol, std_cholesterol)

# probability density for each level
pdf_values = cholesterol_dist.pdf(cholesterol_levels)

# convert into percentage
population_percentages = pdf_values * 5 * 100

Explanation:
We create cholesterol values between 50 and 400 mg/dL. Using the normal distribution formula, we find the probability of each value. Then we convert this into percentages to show how common each level is.

Step 3: Adjust Data to Match NHANES

NHANES 2021–2023 says that 16.7% of adults aged 40–59 have cholesterol ≥ 240 mg/dL. Our model may not match this exactly, so we adjust it.

# percentage in our model with ≥240 mg/dL
current_high = np.sum(population_percentages[cholesterol_levels >= 240])

# target value from NHANES
target_high = 16.7

# correction factor
correction_factor = target_high / current_high

# apply correction to higher values
for i, level in enumerate(cholesterol_levels):
    if level >= 200:  
        weight = min(1.0, (level - 200) / 100)
        population_percentages[i] *= (1 + weight * (correction_factor - 1))

# normalize so total = 100%
population_percentages = (population_percentages / np.sum(population_percentages)) * 100

Explanation:
We check how many people in our model have cholesterol ≥240. If the number is not 16.7%, we multiply by a correction factor. This adjustment is applied mainly to higher values. Finally, we normalize so that the total equals 100%.

Step 4: Save Data as CSV

We save the cholesterol distribution into a CSV file.

df = pd.DataFrame({
    "cholesterol_level": cholesterol_levels,
    "population_perc": population_percentages
})

df.to_csv("cholesterol_distribution.csv", index=False)

Explanation:
We create a pandas DataFrame with two columns: cholesterol level and percentage. Then we save it as a CSV file. This file can be opened in Excel or used in other studies.

Step 5: Visualize the Distribution

We now make a chart to see the cholesterol distribution.

plt.figure(figsize=(12, 8))
plt.plot(cholesterol_levels, population_percentages, 'b-', linewidth=2, label="Cholesterol Distribution")
plt.fill_between(cholesterol_levels, population_percentages, alpha=0.3, color="lightblue")

# highlight 184 mg/dL
cholesterol_184_idx = np.where(cholesterol_levels == 185)[0][0]
cholesterol_184_perc = population_percentages[cholesterol_184_idx]

plt.axvline(x=184, color="red", linestyle="--", linewidth=2)
plt.annotate(f"184 mg/dL\n({cholesterol_184_perc:.2f}%)",
             xy=(185, cholesterol_184_perc),
             xytext=(185, cholesterol_184_perc + 0.5),
             arrowprops=dict(arrowstyle="->", color="red", lw=2),
             fontsize=12, color="red", ha="center")

# category lines
plt.axvline(x=200, color="orange", linestyle=":", linewidth=2)
plt.axvline(x=240, color="darkred", linestyle=":", linewidth=2)

plt.xlabel("Total Cholesterol (mg/dL)")
plt.ylabel("Percentage of People (%)")
plt.title("Cholesterol Distribution in US Males (40–60 years) - NHANES 2021–2023")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Explanation:
The blue curve shows the cholesterol distribution.
A red dashed line shows 184 mg/dL.
An orange dotted line marks 200 mg/dL (border between desirable and borderline).
A dark red dotted line marks 240 mg/dL (border between borderline and high).

Step 6: Calculate Summary Statistics

We calculate key numbers from the dataset.

mean_calc = np.average(cholesterol_levels, weights=population_percentages)
high_chol = np.sum(population_percentages[cholesterol_levels >= 240])
borderline_chol = np.sum(population_percentages[(cholesterol_levels >= 200) & (cholesterol_levels < 240)])
desirable_chol = np.sum(population_percentages[cholesterol_levels < 200])

Explanation:

mean_calc gives the average cholesterol level.
high_chol gives the percentage with cholesterol ≥240 mg/dL.
borderline_chol gives the percentage between 200 and 239 mg/dL.
desirable_chol gives the percentage below 200 mg/dL.

Step 7: Verify CSV File Structure

After saving, we should check if the CSV file is correct. Let’s reload the file and verify its structure, columns, and values.

# Verify CSV file structure
df_verify = pd.read_csv('cholesterol_distribution.csv')

print("📁 CSV FILE VERIFICATION")
print("="*50)
print(f"File name: cholesterol_distribution.csv")
print(f"Columns: {list(df_verify.columns)}")
print(f"Shape: {df_verify.shape}")
print(f"Data range: {df_verify['cholesterol_level'].min()}-{df_verify['cholesterol_level'].max()} mg/dL")
print(f"Total percentage: {df_verify['population_perc'].sum():.2f}%")
print("\nSample data:")
print(df_verify.head())
print("\n✅ CSV file ready for use!")

Explanation:
We reload the CSV file and print some details:

File name → confirms the file we saved.
Columns → shows the two columns (cholesterol_level and population_perc).
Shape → tells the number of rows and columns.
Data range → minimum and maximum cholesterol levels in the file.
Total percentage → ensures that the percentages add up to 100%.
Sample data → first 5 rows of the CSV to quickly check values.

This verification step makes sure our dataset is clean, structured, and ready for analysis or sharing.

Step 8: Percentiles

Percentiles show where a value stands compared to the rest of the population.

# Calculate cumulative percentages for percentile analysis
cumulative_perc = np.cumsum(population_percentages)

# Find key percentiles
def find_percentile(target_percentile):
    idx = np.argmin(np.abs(cumulative_perc - target_percentile))
    return cholesterol_levels[idx]

percentiles = [5, 10, 25, 50, 75, 90, 95]
chol_percentiles = [find_percentile(p) for p in percentiles]

print("📊 CHOLESTEROL PERCENTILES")
print("="*50)
for p, chol in zip(percentiles, chol_percentiles):
    print(f"{p:2d}th percentile: {chol:3d} mg/dL")

# Where does 184 fall?
perc_184 = cumulative_perc[np.where(cholesterol_levels == 185)[0][0]]
print(f"\n🎯 184 mg/dL is at approximately the {perc_184:.1f}th percentile")
print("   (meaning {:.1f}% of males 40-60 have cholesterol ≤ 184 mg/dL)".format(perc_184))

Explanation:
We create a cumulative sum of percentages to find percentiles. For example, if 184 mg/dL is at the 40th percentile, it means 40% of men in this group have cholesterol at or below 184.

Conclusion

This analysis provides a comprehensive view of total cholesterol distribution in US males aged 40–60 based on the most recent NHANES 2021–2023 data.

Key Findings:

The distribution shows a cholesterol level of 184 mg/dL falls well within the "desirable" range.
Approximately 16.7% of men in this age group have high cholesterol (≥240 mg/dL).
The CSV file contains the complete distribution data for further analysis.

Files Created:

cholesterol_distribution.csv – Complete distribution data with columns: cholesterol_level, population_perc

This Jupyter Notebook teaches how to:

Build cholesterol data with average and spread.
Adjust the model to match NHANES survey results.
Save data into a CSV file.
Make a clear chart with category lines.
Calculate average, group percentages, and percentiles.

This step-by-step method is helpful for anyone doing cholesterol analysis or writing about health data.

SDE Techie