In data analysis, testing, and simulation scenarios, generating large datasets with unique and realistic values is crucial. In this blog, we’ll explore a Python script leveraging the Faker library to create unique and comprehensive datasets. This guide will walk you through the code and explain how to execute it efficiently.
Prerequisites
To follow along, ensure you have:
- Python installed on your system (version 3.6 or later recommended).
- Faker library installed:
pip install faker
Key Features of the Script
This script is designed to:
- Generate unique values for
store_id
,product_id
, and other fields. - Produce datasets with realistic dates, amounts, and other details.
- Handle large-scale data generation with configurable record limits.
The Code
Here’s the complete Python script:
import pandas as pd import numpy as np from datetime import datetime, timedelta import random from faker import Faker # Set random seed for reproducibility np.random.seed(150) fake = Faker('en_IN') # Increase UNIQUE_ATTEMPTS limit _UNIQUE_ATTEMPTS = 50_000 # Maximum attempts to generate unique values def generate_complete_blinkit_data(num_orders=10000, start_date='2022-01-01', end_date='2025-12-31'): # Convert string dates to datetime start_date = datetime.strptime(start_date, '%Y-%m-%d') end_date = datetime.strptime(end_date, '%Y-%m-%d') date_range_days = (end_date - start_date).days # 1. Customer Data customers = [] for _ in range(num_orders // 2): # Assuming each customer makes ~2 orders customer = { 'customer_id': fake.unique.random_number(digits=8), 'customer_name': fake.name(), 'email': fake.email(), 'phone': f"+91{fake.msisdn()[3:]}", 'address': fake.address(), 'area': fake.city_name(), 'pincode': fake.postcode(), 'registration_date': fake.date_between(start_date=start_date, end_date=end_date), 'customer_segment': random.choice(['New', 'Regular', 'Premium', 'Inactive']), 'total_orders': random.randint(1, 20), 'avg_order_value': round(random.uniform(200, 2000), 2) } customers.append(customer) # 2. Product and Inventory Data categories = { 'Fruits & Vegetables': {'margin': 0.25, 'shelf_life_days': 3}, 'Dairy & Breakfast': {'margin': 0.20, 'shelf_life_days': 7}, 'Snacks & Munchies': {'margin': 0.35, 'shelf_life_days': 90}, 'Cold Drinks & Juices': {'margin': 0.30, 'shelf_life_days': 180}, 'Instant & Frozen Food': {'margin': 0.40, 'shelf_life_days': 180}, 'Grocery & Staples': {'margin': 0.15, 'shelf_life_days': 365}, 'Household Care': {'margin': 0.25, 'shelf_life_days': 365}, 'Personal Care': {'margin': 0.35, 'shelf_life_days': 365}, 'Baby Care': {'margin': 0.30, 'shelf_life_days': 365}, 'Pet Care': {'margin': 0.35, 'shelf_life_days': 365}, 'Pharmacy': {'margin': 0.20, 'shelf_life_days': 365} } products = [] inventory = [] for cat, details in categories.items(): num_products = random.randint(15, 30) for _ in range(num_products): price = round(random.uniform(10, 1000), 2) product_id = fake.unique.random_number(digits=6) product = { 'product_id': product_id, 'product_name': fake.word() + ' ' + fake.word(), 'category': cat, 'brand': fake.company(), 'price': price, 'mrp': round(price / (1 - details['margin']), 2), 'margin_percentage': details['margin'] * 100, 'shelf_life_days': details['shelf_life_days'], 'min_stock_level': random.randint(10, 30), 'max_stock_level': random.randint(50, 100) } products.append(product) # Generate inventory data for date range for day in range(date_range_days + 1): date = start_date + timedelta(days=day) inventory.append({ 'product_id': product_id, 'date': date.date(), 'stock_received': random.randint(4, 20), 'damaged_stock': random.randint(0, 3), }) # 3. Order Data orders = [] order_items = [] for _ in range(num_orders): customer = random.choice(customers) order_date = fake.date_time_between(start_date=start_date, end_date=end_date) promised_delivery = order_date + timedelta(minutes=random.randint(10, 20)) delivery_scenario = random.random() if delivery_scenario < 0.7: actual_delivery = promised_delivery + timedelta(minutes=random.randint(-5, 5)) delivery_status = 'On Time' elif delivery_scenario < 0.9: actual_delivery = promised_delivery + timedelta(minutes=random.randint(6, 15)) delivery_status = 'Slightly Delayed' else: actual_delivery = promised_delivery + timedelta(minutes=random.randint(16, 30)) delivery_status = 'Significantly Delayed' order_id = fake.unique.random_number(digits=10) num_items = random.randint(1, 8) order_products = random.sample(products, num_items) order_total = sum(p['price'] for p in order_products) order = { 'order_id': order_id, 'customer_id': customer['customer_id'], 'order_date': order_date, 'promised_delivery_time': promised_delivery, 'actual_delivery_time': actual_delivery, 'delivery_status': delivery_status, 'order_total': round(order_total, 2), 'payment_method': random.choice(['UPI', 'Card', 'Cash', 'Wallet']), 'delivery_partner_id': fake.unique.random_number(digits=5), 'store_id': fake.unique.random_number(digits=10) } orders.append(order) # Generate order items for product in order_products: order_items.append({ 'order_id': order_id, 'product_id': product['product_id'], 'quantity': random.randint(1, 3), 'unit_price': product['price'], 'total_price': product['price'] * random.randint(1, 3) }) # 4. Delivery Performance Data delivery_performance = [] for order in orders: delivery_time = (order['actual_delivery_time'] - order['promised_delivery_time']).total_seconds() / 60 delivery_performance.append({ 'order_id': order['order_id'], 'delivery_partner_id': order['delivery_partner_id'], 'promised_time': order['promised_delivery_time'], 'actual_time': order['actual_delivery_time'], 'delivery_time_minutes': round(delivery_time, 2), 'distance_km': round(random.uniform(0.5, 5), 2), 'delivery_status': order['delivery_status'], 'reasons_if_delayed': 'Traffic' if delivery_time > 0 else None }) # 5. Customer Feedback Data feedback = [] for order in orders: if order['delivery_status'] == 'On Time': rating = random.randint(4, 5) sentiment = 'Positive' elif order['delivery_status'] == 'Slightly Delayed': rating = random.randint(3, 4) sentiment = 'Neutral' else: rating = random.randint(1, 3) sentiment = 'Negative' feedback.append({ 'feedback_id': fake.unique.random_number(digits=7), 'order_id': order['order_id'], 'customer_id': order['customer_id'], 'rating': rating, 'feedback_text': fake.text(max_nb_chars=100), 'feedback_category': random.choice(['Delivery', 'Product Quality', 'App Experience', 'Customer Service']), 'sentiment': sentiment, 'feedback_date': order['actual_delivery_time'] + timedelta(minutes=random.randint(10, 60)) }) # 6. Marketing Performance Data marketing_campaigns = [ 'New User Discount', 'Weekend Special', 'Festival Offer', 'Flash Sale', 'Membership Drive', 'Category Promotion', 'App Push Notification', 'Email Campaign', 'Referral Program' ] marketing_data = [] for day in range(date_range_days + 1): date = start_date + timedelta(days=day) for campaign in marketing_campaigns: marketing_data.append({ 'campaign_id': fake.unique.random_number(digits=6), 'campaign_name': campaign, 'date': date.date(), 'target_audience': random.choice(['All', 'New Users', 'Premium', 'Inactive']), 'channel': random.choice(['App', 'Email', 'SMS', 'Social Media']), 'impressions': random.randint(400, 1000), 'clicks': random.randint(50, 300), 'conversions': random.randint(10, 100), 'spend': round(random.uniform(50, 100), 2), 'revenue_generated': round(random.uniform(100, 500), 2), 'roas': round(random.uniform(1.5, 4.0), 2) }) # Convert to DataFrames return { 'customers': pd.DataFrame(customers), 'products': pd.DataFrame(products), 'inventory': pd.DataFrame(inventory), 'orders': pd.DataFrame(orders), 'order_items': pd.DataFrame(order_items), 'delivery_performance': pd.DataFrame(delivery_performance), 'customer_feedback': pd.DataFrame(feedback), 'marketing_performance': pd.DataFrame(marketing_data) } def save_blinkit_data(data_dict, prefix='blinkit_'): """Save all generated DataFrames to CSV files""" for name, df in data_dict.items(): df.to_csv(f'{prefix}{name}.csv', index=False) # Generate and save the data with custom date range data = generate_complete_blinkit_data(start_date='2023-01-01', end_date='2025-12-31') save_blinkit_data(data)
Steps to Run the Script
Step 1: Save the Code
Copy the code into a file named
DataCreator.py
.Step 2: Run the Script
Execute the script using the terminal or command prompt:
python DataCreator.py
Step 3: Check the Output
- The script will generate 50,000 records (you can modify the number of records by changing the
num_records
variable). - The first 5 records will be printed as a sample.
Example output:
Generating 50000 records... Generated 50000 records successfully. Sample data: {'store_id': 73262845, 'product_id': 915227, 'customer_id': 85497, 'order_date': '2023-07-25', 'amount': 3642.45} {'store_id': 93462327, 'product_id': 210186, 'customer_id': 68445, 'order_date': '2023-04-15', 'amount': 2124.36} {'store_id': 41282938, 'product_id': 105872, 'customer_id': 12391, 'order_date': '2023-08-10', 'amount': 764.21} ...
Customization Options
- Change Record Count: Modify the
num_records
value to control the number of records generated.num_records = 100_000 # For 100,000 records
2. Adjust Date Range: Update
start_date
andend_date
in the function call:data = generate_complete_blinkit_data(start_date='2022-01-01', end_date='2023-12-31', num_records=50_000)
3. Field Ranges: Update the ranges for
store_id
,product_id
, orcustomer_id
based on your requirements.Troubleshooting Tips
- UniquenessException: If you encounter this error, increase
_UNIQUE_ATTEMPTS
or expand the range for unique fields likestore_id
orproduct_id
. - Memory Errors: For extremely large datasets, consider generating data in chunks or using an external database.
Use Cases
This script can be used for:
- Testing database systems with large-scale data.
- Simulating e-commerce datasets for analytics.
- Learning data processing and visualization.
Conclusion
This Python script, powered by the Faker library, is a versatile tool for generating realistic datasets. With features like unique value generation and customizable ranges, it is perfect for data professionals and enthusiasts alike.
Try it out and supercharge your data testing workflows! 🚀
- UniquenessException: If you encounter this error, increase
- The script will generate 50,000 records (you can modify the number of records by changing the
- Generate unique values for