SQLsynth
This package synthesizes natural language and SQL pairs for evaluating NL2SQL models under a given database schema.
README
<h1 align="center"> SQLsynth <img width="27" height="27" alt="adobe_logo" src="https://github.com/user-attachments/assets/4e10e149-e11f-4a4a-abcd-d8b05e0eddaa" /> <img width="30" height="30" alt="acm" src="https://github.com/user-attachments/assets/a21fc439-2b15-4b55-a329-95fa8551772f" /> </h1>
[IUI 2025] Text-to-SQL Domain Adaptation via Human-LLM Collaborative Data Annotation
This is the repo for the <img width="30" height="30" alt="iui_icon" src="https://github.com/user-attachments/assets/bc496a3c-fad1-4950-a386-79df579d6ffc" /> IUI 2025 paper
SQLsynth is not only an interactive data annotation but also automated data synthesis tool designed for quickly creating highly customized (e.g., schema, DB records, distribution) text-to-SQL datasets.
๐ Live Demo: Try me ๐๐ป
Note: Click the update button after customizing the schema
<br>
๐ Features
- Database Schema Customization
- Freely create, edit, annotate (use NL to label the semantics of database fields, useful for LLMs) in the canvas.
- ๐ฆ --> A highly customized database schema, with meaningful descriptions
- Database Records Population
- Given a database schema, populate it with concrete records
- Rule-based method (No LLM calling)
- Recognized for different datatype
- Distribution is configurable
- ๐ฆ --> A complete, customized database full of records
- SQL Query Sampling
- Given a database, randomly sample SQL queries.
- Based on PCFG (Probability Context-Free Grammar) and other rules to extract records from a specified database.
- The probability distribution is configurable (e.g., increase the number of queries with WHERE clauses or those involving a specific column).
- Syntax is customizable (e.g., support for user-defined SQL dialect).
- Filter by executing the SQL to ensure it is associated with meaningful execution results
- ๐ฆ --> A large amount of SQL queries (with a customized distribution) under the provided database
- SQL-to-Text Generation
- Convert SQL queries into NL questions
- Three stages:
- Convert the SQL query into step-by-step NL explanations by a grammar-based method.
- Conduct in-context on specified real-world data for style adaptation
- Generating the NL question by LLMs
- ๐ฆ --> A large amount of (NL, SQL) pairs under the customized database, where NL questions may be perfect (ambiguous, lack details, etc.)
- Text-SQL Alignment
- Mapping NL components (substrings) to SQL compoenents (clauses)
- Error checking for generated NL (note that the SQL is absolutely correct)
- Use to analyze (1) what information may be missing (the SQL component fails to map to NL components), and (2) what information may be redundant (the NL component doesn't map to any SQL compoenent)
- Interactively highlight by visual correspondence in the UI
- Scoring: Given a NL-SQL paier, evaluate the quality on a scale from 0 to 100, with analysis in NL
- ๐ฆ --> A large amount of refined (NL, SQL) pairs under the customized database
- Dataset Statistics & Visualization
- Upload and analyze existing SQL query datasets
- Assist users in tracking datasets from a dataset-level perspective
- Comprehensive statistics dashboard with summary metrics (total queries, unique keywords, average complexity),including:
- SQL structure distribution
- Keyword frequency distribution
- Clause number distribution
- Column and table usage patterns
- Query complexity distribution
- Reference value distribution
- ๐ฆ --> Insights into dataset characteristics and qualities
<br>
๐ฆ Installation
Backend Setup
-
Clone the repository:
git clone https://github.com/yourusername/SQLsynth.git cd SQLsynth -
Install Python dependencies:
cd backend pip install flask pip install flask_cors pip install sql-metadata pip install openai pip install nltk pip install spacy pip install sqlparse python -m spacy download en_core_web_sm -
Configure LLM API:
- Open
backend/openai_api.py - Implement your own
get_openai_response()function - The function should take a string prompt as input and return a string response
- Open
Frontend Setup
-
Install Node.js dependencies:
cd frontend npm install -
If you encounter missing dependencies, please use
npm installfor necessary packages based on pop-up instructions.
<br>
๐ Quick Start
Running the Application
-
Start the Backend Server:
cd backend python server.pyThe backend will run on
http://localhost:5001by default. -
Start the Frontend:
cd frontend npm startThe frontend will run on
http://localhost:3000by default. -
Open your browser and navigate to
http://localhost:3000 -
Enjoy! ๐
<br>
Project Structure
SQLsynth_repo/
โโโ backend/
โ โโโ server.py # Main Flask server
โ โโโ SQL_synthesizer.py # Query synthesis engine
โ โโโ SQL2NL_clean.py # Rule-based SQL parser
โ โโโ llm_analysis.py # LLM prompts and analysis
โ โโโ records_synthesizer.py # Record generation
โ โโโ ICL_retriever.py # Example retrieval
โ โโโ db_handling.py # Database utilities
โ โโโ openai_api.py # LLM API interface
โ โโโ evaluation_steps.py # Evaluation tools
โ โโโ *_config.json # Configuration files
โ โโโ output_data/ # Generated datasets
โ โโโ temp_db/ # Temporary databases
โโโ frontend/
โ โโโ src/
โ โ โโโ App.jsx # Main app component
โ โ โโโ SchemaTab.jsx # Schema designer
โ โ โโโ DatabaseTab.jsx # Record management
โ โ โโโ DatasetTab.jsx # Dataset synthesis
โ โ โโโ AnalysisTab.jsx # Analysis interface
โ โโโ public/ # Static assets
โ โโโ package.json # Dependencies
โโโ user_study/
โ โโโ spider_schemas/ # 166 Spider schemas
โโโ README.md
Backend (backend/)
server.py: Flask server handling all API endpointsSQL_synthesizer.py: PCFG-based SQL query generationSQL2NL_clean.py: Rule-based SQL decomposition and explanationllm_analysis.py: LLM prompts and analysis functionsrecords_synthesizer.py: Database record generation with constraint satisfactionICL_retriever.py: In-context learning example retrievaldb_handling.py: Database operations and utilitiesopenai_api.py: LLM API interface (user-implemented)evaluation_steps.py: Evaluation utilities
Frontend (frontend/src/)
App.jsx: Main application component with global state managementSchemaTab.jsx: Interactive schema designerDatabaseTab.jsx: Database record management interfaceDatasetTab.jsx: Dataset synthesis and downloadAnalysisTab.jsx: SQL-NL alignment analysisSQLSubexpressionCorrespondence.jsx: Visual representation of SQL components
Configuration Files
manual_config.json: Manual probability configuration for SQL synthesislearned_config.json: Learned probability distribution from existing datasetsspider_example_pool.json: Example pool for in-context learning
<br>
Usage
1. Schema Design
- Import Schema: Drag and drop a JSON schema file
- Edit Schema: Add/remove tables and columns
- Define Relationships: Specify primary and foreign keys
- Add Descriptions: Document tables and columns for better NL generation
Schema format example:
{
"users": {
"comment": "User information table",
"columns": [
{
"field": "user_id",
"type": "text",
"isPrimary": true,
"comment": "Unique user identifier"
},
{
"field": "username",
"type": "text",
"comment": "User's login name"
}
]
}
}
2. Record Synthesis
- Click "Generate Records" to create synthetic data
- Specify the number of records to generate
- Records respect foreign key constraints and data types
- Export records to JSON
3. NL-SQL pair Synthesis
- Configure query distribution (number of tables, columns, clauses)
- Generate individual queries or batch synthesis
- View step-by-step SQL decomposition
- Get suggested natural language descriptions
- Check alignment between SQL and NL
4. NL-SQL pair Scoring
- Evaluate the quality and equivalence between natural language questions and SQL queries
- Score with explanations
5. Dataset Analysis
- Upload existing SQL query datasets
- View comprehensive statistics:
- Keyword distribution
- Query structure patterns
- Clause complexity
- Column and table usage
- Query complexity metrics
Script-Based Synthesis
While human-in-the-loop guarantees the data quality, you can also opt for large-scale dataset generation without the UI:
from server import auto_synthetic_data
synthetic_data = auto_synthetic_data(
schema_path="backend/saved_frontend_schema.json",
save_path="backend/output_data/synthetic_data.jsonl",
config_path="backend/learned_config.json",
synthesized_DB_records_path="backend/output_data/DB_records.json",
example_path="backend/spider_example_pool.json",
data_num=2000
)
Parameters:
schema_path: Path to the database schema JSON filesave_path: Output file path for synthetic dataconfig_path: Configuration file for query distributionsynthesized_DB_records_path: Path to save generated database recordsexample_path: Path to example pool for in-context learningdata_num: Number of SQL-NL pairs to generate
Configuration
<details>
<summary>Click to expand configuration details</summary>
Query Distribution Configuration
Adjust probabilities in learned_config.json or manual_config.json:
{
"sample_table_probs": [0.5, 0.3, 0.2],
"sample_column_probs": [0.4, 0.3, 0.2, 0.1],
"select_star_prob": 0.2,
"where_clause_prob": 0.3,
"group_by_clause_prob": 0.2,
"order_by_clause_prob": 0.3,
"having_clause_prob": 0.3,
"limit_clause_count": 0.1
}
Network Configuration
Change Backend Port
Edit backend/server.py:
app.run(debug=True, host="0.0.0.0", port=YOUR_PORT)
Change Frontend Port
# macOS/Linux
PORT=4000 npm start
# Windows
set PORT=4000 && npm start
Deploy on Server
Replace localhost with your server IP in frontend/src/App.jsx:
const ip = 'your.server.ip'; // or domain name
const port = 5001;
</details>
API Reference
<details>
<summary>Click to expand API reference details</summary>
Key Endpoints
POST /step_by_step_description
Generate step-by-step explanation for a SQL query.
Request:
{
"sql": "SELECT name FROM users WHERE age > 18",
"schema": {...}
}
Response:
{
"explanation_data": [...]
}
POST /suggested_nl
Get suggested natural language description for SQL.
Request:
{
"sql": "...",
"schema": {...},
"parsed_step_by_step_data": [...]
}
Response:
{
"nl_query": "What are the names of users older than 18?",
"examples": [...]
}
POST /check_alignment
Check alignment between NL and SQL components.
Request:
{
"sql": "...",
"nl": "...",
"schema": {...},
"parsed_step_by_step_data": [...]
}
Response:
{
"alignment_data": [...],
"uncovered_substrings": [...]
}
POST /synthesize_records
Generate synthetic database records.
Request:
{
"schema": {...},
"num": 100
}
Response:
{
"synthetic_records": {...}
}
POST /synthetic_sql
Generate a random SQL query.
Request:
{
"schema": {...},
"records": {...}
}
Response:
{
"synthetic_sql": "SELECT ...",
"config": {...}
}
POST /analyze_dataset
Analyze an uploaded SQL query dataset.
Request: Multipart form data with file upload
Response:
{
"totalQueries": 1000,
"averageComplexity": 12.5,
"keywordDistribution": {...},
"structureDistribution": {...},
...
}
</details>
<br>
Citation
If you use SQLsynth in your research, please cite:
@inproceedings{Tian_2025, series={IUI โ25},
title={Text-to-SQL Domain Adaptation via Human-LLM Collaborative Data Annotation},
url={http://dx.doi.org/10.1145/3708359.3712083},
DOI={10.1145/3708359.3712083},
booktitle={Proceedings of the 30th International Conference on Intelligent User Interfaces},
publisher={ACM},
author={Tian, Yuan and Lee, Daniel and Wu, Fei and Mai, Tung and Qian, Kun and Sahai, Siddhartha and Zhang, Tianyi and Li, Yunyao},
year={2025},
month=mar, pages={1398โ1425},
collection={IUI โ25} }
<br>
Acknowledgments
- This work was completed during an internship at Adobe
- Adobe Property <img width="20" height="20" alt="adobe_logo" src="https://github.com/user-attachments/assets/4e10e149-e11f-4a4a-abcd-d8b05e0eddaa" />
<br>
Contact
For questions or feedback, please open an issue on GitHub or contact me by [email protected].
