Have you ever had to tackle problems during exploratory data analysis involving multiple files in a large dataset? Even if you’re unsure where to begin, how to write code in programming languages like Python, R, and SQL, or how to set up environments for data analysis.
Here is a “Magic Wand” that allows you to complete these processes without any code. Simply use natural language prompts with advanced AI tools based on LLMs. It serves as a powerful shortcut for initial exploration.
Through effective communication with AI tools for data analysis, answers are delivered instantly, unlike the conventional analysis process that requires setting up environments, importing necessary libraries, and writing code.
This content guides essential basic EDA tasks, such as generating statistical summaries or visualizing data distributions, by utilizing intuitive prompts through AI tools.
1. What Is Exploratory Data Analysis (EDA)?
Exploratory data analysis (EDA) is a crucial process in any data analysis project as it allows one to become familiar with the data before delving into complex modelling or decision-making.
The objective of EDA is to understand the dataset and gain insights: summarizing statistical analysis, visualizing data to identify patterns, trends, correlations, outliers or missing values, and checking assumptions necessary for further analysis.
Understanding the EDA process is vital because it helps prevent mistakes, informs feature engineering, guides model selection, and ultimately leads to more reliable insights. It serves as the foundation for data-driven decision-making.
2. Why Use Prompts for Basic EDA?
Leveraging prompts for basic EDA can be excellent as a starting point, while complex analysis still requires traditional methods and a deeper statistical understanding.
- Speed and Efficiency: Get immediate answers to basic questions without the need to write or debug code.
- Accessibility: Democratizes basic analysis without technical barriers. It empowers individuals who are not proficient coders but understand data questions.
- Intuitive Interaction: Leverages natural language.
- Focus on the "What," Not the "How": Encourages analysts to think critically about the questions they want to ask the data, rather than getting caught up in implementation details.
- Iterative Exploration: Easily adjust prompts and ask follow-up questions based on initial results. Facilitates a rapid exploration cycle.
3. Preparing Your Data and Tool
Prerequisites: Ability to prompt through AI tools to interpret data analysis, such as AI notebooks, data analysis platforms with chat interfaces, and code-generating assistance within IDEs.
Data Loading: Load the dataset files into AI tools (Assume the data has been cleaned).
- Titanic Dataset from Kaggle
- AI tool: ChatGPT
Context is key: Ensure that the AI knows which dataset you’re referring to in subsequent prompts.
Example dataset: Start with a simple, well-known dataset for practice, such as Iris, Titanic, or a hypothetical ‘sales_data’ with columns like ‘Region’, ‘SalesAmount’, and ‘Age’.
4. Prompts for Statistical Summaries
📍 Types of Summaries & Prompts:
1️⃣ Overall numerical summary: AI will extract and present numerical values from the Titanic dataset.
Prompt Examples:
- "Describe the dataset."
- "Show basic statistics (mean, median, std dev, min, max, count) for all numeric fields."
- Define the desired output format, such as a table or list, etc.
👨🏾💻 Prompt: “Provide the statistical summary of the columns of ‘Survived’, ‘Age’, and ‘Pclass’ in the dataset. Format it as a table.”
🤖 AI output:
The numerical summary table for ‘Survived’, ‘Age’, and ‘Pclass’ shows count, mean, standard deviation, and min/percentile/max values.
2️⃣ Targeted Numerical Summary for Specific Columns:
Prompt Examples:
- "What are the mean and median 'SalesAmount'?"
- "Summarize the 'Age' column."
- "Calculate the standard deviation for 'ProductPrice'."
👩💻 Prompt: “What are the mean and median values of the ‘Age’ column in the dataset?”
🤖 AI output:
3️⃣ Categorical Summary for Counts & Frequencies:
Prompt Examples:
- "Show the unique values and their counts for the 'Region' column."
- "What are the different categories in 'ProductType'?”
- A list of all unique categories.
- The number of occurrences for each category.
- A summary presented in a specific format, like a table or bullet points.
🧑🏻💻 Prompt: “Show the unique values and their counts for the ‘Sex’ column in the dataset.”
🤖 AI output:
4️⃣ Grouped Summaries:
Prompt Examples:
- "Calculate the average 'SalesAmount' for each 'Region'."
- "Show the median 'Age' grouped by 'JobTitle'."
- The desired output format: a table or bullet points
👩💻 Prompt: Calculate the percentage of ‘Survived’ grouped by ‘Sex’
🤖 AI output:
Counting categorical data is crucial because it provides a clear understanding of how frequently different categories appear in the dataset. It allows for identifying trends and distributions, comparing groups, handling imbalanced data, and decision-making in models.
5. Prompts for Data Visualization?
Daya visualization is a critical part of the exploratory data analysis (EDA) process because it helps uncover patterns, trends, and insights that might be hidden in raw data. It makes complex data understandable, identifies patterns and trends, detects outliers and data quality issues, and enhances decision-making. Further, it facilitates communication with audiences.
📍Common Visualizations & Prompts:
1. Histograms for Frequency Distribution:
Prompt Examples:
- "Plot a histogram of 'Age'.",
- "Show the distribution of 'SalesAmount'.",
- "Create a histogram for 'CustomerLifetimeValue' with 15 bins."
Prompt: Create a histogram of ‘Age’ with 25 bins.
AI output:
The histogram offers a granular view of age frequency, highlighting distinct age clusters and rare extreme values.
2. Bar Plots for comparing groups with subcategories:
A bar plot with a hue is useful in data analysis when you need to compare different categories while also breaking them down by an additional categorical variable. So, this it makes easier to spot and improves readability for subgroup comparisons.
Prompt Examples:
- Survival rate by passenger class in the Titanic dataset
Prompt: Create a bar plot to show the number of people who survived based on their ‘Pclass’ and ‘Sex’. X-axis is ‘Sex’, Y-axis is ‘Survived’, and Hue is ‘Pclass’.
AI output:
The bar plot displays the survival rate by passenger ‘Sex’, with separate bars for each ‘Pclass’.
Female: Highest survival in 1st class (~97%), 2nd (~92%), 3rd (~50%).
Males: Highest survival in 1st class (~37%), 2nd (~16%), 3rd (~14%).
3. Box Plots for Summary & Outliers:
Prompt Example:
- "Create a box plot for 'TestScore'."
- Good for comparing groups: "Box plot of 'Salary' by 'Department'."
Prompt: Create a box plot of the ‘Survived’ group by ‘Age’.
AI output:
The box plot visualizes the age distributions for passengers who survived versus those who did not:
Median (orange line): Both groups have a median age around the high 20s, with the survivors’ median slightly lower.
Interquartile range (IQR): Survivors show a narrower age spread between the 25th and 75th percentiles.
Whiskers & outliers: Non-survivors include older extreme ages (up to ~74), while survivors peak around 80+.
6. Crafting Effective Prompts
- Be Specific: Define the column names, the type of analysis desired, and the name of the dataset.
- Start Simple, Iterate: Ask broad questions first, then refine with more specific follow-ups.
- Use Action Verbs: "Calculate," "Plot," "Show," "Summarize," "Count," "Describe."
- Provide Context: When asking follow-up questions, refer back to the previous summary to maintain clarity and continuity.
- Specify Parameters: Clearly state any specific requirements, such as the number of bins for a histogram or the specific statistics needed (e.g., mean, median).
- Experiment: Don't be afraid to rephrase your questions if the AI doesn't understand initially.
7. Limitations and Important Considerations
- Ambiguity: AI might misinterpret vague prompts.
- Complexity: May struggle with multi-step, highly complex analysis requests in a single prompt.
- "Black Box" Element: Understanding how the AI calculated something might be less transparent than code.
- Critical Thinking Required: The AI provides results, but you must interpret them correctly and critically. It doesn't replace statistical knowledge.
Final Thoughts: Find Hidden Gems 💎✨ with Prompts
Prompt engineering provides a quick and easy way to conduct essential EDA tasks like generating statistical summaries for numerical values and visualizing distributions, patterns, and outliers.
It facilitates the EDA process by using AI tools to ask questions and unlock new opportunities for data exploration to aid in decision-making.
Start exploring data by utilizing prompt techniques.
"Give these prompts a try on your dataset! What insights do you discover within the first few minutes?
Feel free to share your experiences or prompt techniques in the comments below."
🔔Subscribe for more insights!
Comments
Post a Comment