How to Generate Unique and Large-Scale Data Using Python and Faker

In data analysis, testing, and simulation scenarios, generating large datasets with unique and realistic values is crucial. In this blog, we’ll explore a Python script leveraging the Faker library to create unique and comprehensive datasets. This guide will walk you through the code and explain how to execute it efficiently.

Prerequisites

To follow along, ensure you have:

  1. Python installed on your system (version 3.6 or later recommended).
  2. Faker library installed:
    pip install faker
    

    Key Features of the Script

    This script is designed to:

    1. Generate unique values for store_id, product_id, and other fields.
    2. Produce datasets with realistic dates, amounts, and other details.
    3. Handle large-scale data generation with configurable record limits.

      The Code

      Here’s the complete Python script:

      import pandas as pd
      import numpy as np
      from datetime import datetime, timedelta
      import random
      from faker import Faker
      
      # Set random seed for reproducibility
      np.random.seed(150)
      fake = Faker('en_IN')
      
      # Increase UNIQUE_ATTEMPTS limit
      _UNIQUE_ATTEMPTS = 50_000  # Maximum attempts to generate unique values
      
      def generate_complete_blinkit_data(num_orders=10000, start_date='2022-01-01', end_date='2025-12-31'):
          # Convert string dates to datetime
          start_date = datetime.strptime(start_date, '%Y-%m-%d')
          end_date = datetime.strptime(end_date, '%Y-%m-%d')
          date_range_days = (end_date - start_date).days
      
          # 1. Customer Data
          customers = []
          for _ in range(num_orders // 2):  # Assuming each customer makes ~2 orders
              customer = {
                  'customer_id': fake.unique.random_number(digits=8),
                  'customer_name': fake.name(),
                  'email': fake.email(),
                  'phone': f"+91{fake.msisdn()[3:]}",
                  'address': fake.address(),
                  'area': fake.city_name(),
                  'pincode': fake.postcode(),
                  'registration_date': fake.date_between(start_date=start_date, end_date=end_date),
                  'customer_segment': random.choice(['New', 'Regular', 'Premium', 'Inactive']),
                  'total_orders': random.randint(1, 20),
                  'avg_order_value': round(random.uniform(200, 2000), 2)
              }
              customers.append(customer)
      
          # 2. Product and Inventory Data
          categories = {
              'Fruits & Vegetables': {'margin': 0.25, 'shelf_life_days': 3},
              'Dairy & Breakfast': {'margin': 0.20, 'shelf_life_days': 7},
              'Snacks & Munchies': {'margin': 0.35, 'shelf_life_days': 90},
              'Cold Drinks & Juices': {'margin': 0.30, 'shelf_life_days': 180},
              'Instant & Frozen Food': {'margin': 0.40, 'shelf_life_days': 180},
              'Grocery & Staples': {'margin': 0.15, 'shelf_life_days': 365},
              'Household Care': {'margin': 0.25, 'shelf_life_days': 365},
              'Personal Care': {'margin': 0.35, 'shelf_life_days': 365},
              'Baby Care': {'margin': 0.30, 'shelf_life_days': 365},
              'Pet Care': {'margin': 0.35, 'shelf_life_days': 365},
              'Pharmacy': {'margin': 0.20, 'shelf_life_days': 365}
          }
          
          products = []
          inventory = []
          
          for cat, details in categories.items():
              num_products = random.randint(15, 30)
              for _ in range(num_products):
                  price = round(random.uniform(10, 1000), 2)
                  product_id = fake.unique.random_number(digits=6)
                  
                  product = {
                      'product_id': product_id,
                      'product_name': fake.word() + ' ' + fake.word(),
                      'category': cat,
                      'brand': fake.company(),
                      'price': price,
                      'mrp': round(price / (1 - details['margin']), 2),
                      'margin_percentage': details['margin'] * 100,
                      'shelf_life_days': details['shelf_life_days'],
                      'min_stock_level': random.randint(10, 30),
                      'max_stock_level': random.randint(50, 100)
                  }
                  products.append(product)
                  
                  # Generate inventory data for date range
                  for day in range(date_range_days + 1):
                      date = start_date + timedelta(days=day)
                      inventory.append({
                          'product_id': product_id,
                          'date': date.date(),
                          'stock_received': random.randint(4, 20),
                          'damaged_stock': random.randint(0, 3),
                      })
      
          # 3. Order Data
          orders = []
          order_items = []
          
          for _ in range(num_orders):
              customer = random.choice(customers)
              order_date = fake.date_time_between(start_date=start_date, end_date=end_date)
              promised_delivery = order_date + timedelta(minutes=random.randint(10, 20))
              
              delivery_scenario = random.random()
              if delivery_scenario < 0.7:
                  actual_delivery = promised_delivery + timedelta(minutes=random.randint(-5, 5))
                  delivery_status = 'On Time'
              elif delivery_scenario < 0.9:
                  actual_delivery = promised_delivery + timedelta(minutes=random.randint(6, 15))
                  delivery_status = 'Slightly Delayed'
              else:
                  actual_delivery = promised_delivery + timedelta(minutes=random.randint(16, 30))
                  delivery_status = 'Significantly Delayed'
                  
              order_id = fake.unique.random_number(digits=10)
              num_items = random.randint(1, 8)
              order_products = random.sample(products, num_items)
              order_total = sum(p['price'] for p in order_products)
              
              order = {
                  'order_id': order_id,
                  'customer_id': customer['customer_id'],
                  'order_date': order_date,
                  'promised_delivery_time': promised_delivery,
                  'actual_delivery_time': actual_delivery,
                  'delivery_status': delivery_status,
                  'order_total': round(order_total, 2),
                  'payment_method': random.choice(['UPI', 'Card', 'Cash', 'Wallet']),
                  'delivery_partner_id': fake.unique.random_number(digits=5),
                  'store_id': fake.unique.random_number(digits=10)
              }
              orders.append(order)
              
              # Generate order items
              for product in order_products:
                  order_items.append({
                      'order_id': order_id,
                      'product_id': product['product_id'],
                      'quantity': random.randint(1, 3),
                      'unit_price': product['price'],
                      'total_price': product['price'] * random.randint(1, 3)
                  })
      
          # 4. Delivery Performance Data
          delivery_performance = []
          for order in orders:
              delivery_time = (order['actual_delivery_time'] - order['promised_delivery_time']).total_seconds() / 60
              delivery_performance.append({
                  'order_id': order['order_id'],
                  'delivery_partner_id': order['delivery_partner_id'],
                  'promised_time': order['promised_delivery_time'],
                  'actual_time': order['actual_delivery_time'],
                  'delivery_time_minutes': round(delivery_time, 2),
                  'distance_km': round(random.uniform(0.5, 5), 2),
                  'delivery_status': order['delivery_status'],
                  'reasons_if_delayed': 'Traffic' if delivery_time > 0 else None
              })
      
          # 5. Customer Feedback Data
          feedback = []
          for order in orders:
              if order['delivery_status'] == 'On Time':
                  rating = random.randint(4, 5)
                  sentiment = 'Positive'
              elif order['delivery_status'] == 'Slightly Delayed':
                  rating = random.randint(3, 4)
                  sentiment = 'Neutral'
              else:
                  rating = random.randint(1, 3)
                  sentiment = 'Negative'
                  
              feedback.append({
                  'feedback_id': fake.unique.random_number(digits=7),
                  'order_id': order['order_id'],
                  'customer_id': order['customer_id'],
                  'rating': rating,
                  'feedback_text': fake.text(max_nb_chars=100),
                  'feedback_category': random.choice(['Delivery', 'Product Quality', 'App Experience', 'Customer Service']),
                  'sentiment': sentiment,
                  'feedback_date': order['actual_delivery_time'] + timedelta(minutes=random.randint(10, 60))
              })
      
          # 6. Marketing Performance Data
          marketing_campaigns = [
              'New User Discount',
              'Weekend Special',
              'Festival Offer',
              'Flash Sale',
              'Membership Drive',
              'Category Promotion',
              'App Push Notification',
              'Email Campaign',
              'Referral Program'
          ]
          
          marketing_data = []
          for day in range(date_range_days + 1):
              date = start_date + timedelta(days=day)
              for campaign in marketing_campaigns:
                  marketing_data.append({
                      'campaign_id': fake.unique.random_number(digits=6),
                      'campaign_name': campaign,
                      'date': date.date(),
                      'target_audience': random.choice(['All', 'New Users', 'Premium', 'Inactive']),
                      'channel': random.choice(['App', 'Email', 'SMS', 'Social Media']),
                      'impressions': random.randint(400, 1000),
                      'clicks': random.randint(50, 300),
                      'conversions': random.randint(10, 100),
                      'spend': round(random.uniform(50, 100), 2),
                      'revenue_generated': round(random.uniform(100, 500), 2),
                      'roas': round(random.uniform(1.5, 4.0), 2)
                  })
      
          # Convert to DataFrames
          return {
              'customers': pd.DataFrame(customers),
              'products': pd.DataFrame(products),
              'inventory': pd.DataFrame(inventory),
              'orders': pd.DataFrame(orders),
              'order_items': pd.DataFrame(order_items),
              'delivery_performance': pd.DataFrame(delivery_performance),
              'customer_feedback': pd.DataFrame(feedback),
              'marketing_performance': pd.DataFrame(marketing_data)
          }
      
      def save_blinkit_data(data_dict, prefix='blinkit_'):
          """Save all generated DataFrames to CSV files"""
          for name, df in data_dict.items():
              df.to_csv(f'{prefix}{name}.csv', index=False)
      
      # Generate and save the data with custom date range
      data = generate_complete_blinkit_data(start_date='2023-01-01', end_date='2025-12-31')
      save_blinkit_data(data)
      

       

      Steps to Run the Script

      Step 1: Save the Code

      Copy the code into a file named DataCreator.py.

      Step 2: Run the Script

      Execute the script using the terminal or command prompt:

      python DataCreator.py

      Step 3: Check the Output

      • The script will generate 50,000 records (you can modify the number of records by changing the num_records variable).
      • The first 5 records will be printed as a sample.

      Example output:

      Generating 50000 records...
      Generated 50000 records successfully.
      Sample data:
      {'store_id': 73262845, 'product_id': 915227, 'customer_id': 85497, 'order_date': '2023-07-25', 'amount': 3642.45}
      {'store_id': 93462327, 'product_id': 210186, 'customer_id': 68445, 'order_date': '2023-04-15', 'amount': 2124.36}
      {'store_id': 41282938, 'product_id': 105872, 'customer_id': 12391, 'order_date': '2023-08-10', 'amount': 764.21}
      ...
      

      Customization Options

      1. Change Record Count: Modify the num_records value to control the number of records generated.
        num_records = 100_000  # For 100,000 records

        2. Adjust Date Range: Update start_date and end_date in the function call:

        data = generate_complete_blinkit_data(start_date='2022-01-01', end_date='2023-12-31', num_records=50_000)
        

        3. Field Ranges: Update the ranges for store_id, product_id, or customer_id based on your requirements.

        Troubleshooting Tips

        • UniquenessException: If you encounter this error, increase _UNIQUE_ATTEMPTS or expand the range for unique fields like store_id or product_id.
        • Memory Errors: For extremely large datasets, consider generating data in chunks or using an external database.

        Use Cases

        This script can be used for:

        • Testing database systems with large-scale data.
        • Simulating e-commerce datasets for analytics.
        • Learning data processing and visualization.

        Conclusion

        This Python script, powered by the Faker library, is a versatile tool for generating realistic datasets. With features like unique value generation and customizable ranges, it is perfect for data professionals and enthusiasts alike.

        Try it out and supercharge your data testing workflows! 🚀

Leave a Reply

Shopping cart

0
image/svg+xml

No products in the cart.

Continue Shopping