Are you data modeling for a custom OLTP application starting from scratch, with business requirements as the only initial input?
No matter what the size, domain or location of your project, the following lessons will help to ensure that your physical data modeling process is smooth and that you end up with a database design that is accurate, usable and scalable.
1. Convene, Communicate, Correlate, Consolidate
The larger the scope of the project, the bigger the data modeling team, the more the cross-functional dependencies – the higher the need to plan time for "the 4 Cs", so to speak. Getting the team together, forming/communicating decisions, finding patterns between entities identified by different data modelers and merging or splitting entities as a consequence, consolidating work periodically is important to ensure there are no gaps or redundancies, and that the model caters to the business requirements in the most efficient way.
2. Choose the "right" approach from a set of alternatives
You will often be faced with situations in which a number of design alternatives can address the same requirement. For example, maintaining the employee hierarchy in a single EMP table of the SCOTT schema variety, versus creating a separate Employee Reporting structure table to store the same data.
How do you choose the "right" approach?
Sometimes all it takes is a quick listing of the pros and cons of each approach and a discussion across the table to pick one. For more complex requirements with a wider impact, the standard Decision Analysis and Resolution (DAR) format can be adopted. In DAR, parameters that will drive the decision are listed with a weight of 1-5 (1 least important, 5 most important) attached to each parameter. Each approach is rated on a scale of 1-5 (1 least suitable, 5 most suitable) against the parameters. The weighted total is derived for the two approaches.
A sample DAR evaluation to choose between two database design solutions:
Parameter | Weight | Approach 1 | Approach 2 |
Data model simplicity | 3 | 2 | 3 |
Ease of application use | 2 | 4 | 2 |
Flexibility | 2 | 1 | 3 |
Performance | 4 | 3 | 4 |
Total | 28 | 35 |
Approach 2 wins with a higher score.
The DAR format is useful as a formal record of the reasons why a specific approach was chosen, if the question comes up later at a time when the team has presumably moved on to other projects.
3. Define standards before you start
…and it isn’t enough to define them. Have everyone on the data modeling team be fully aware of them. If a standard is missing, encourage the team to ask and update your standards document.
A checklist for the basic standards to set:
- Should table names end in singular or plural?
- Should column names include or exclude the table name?
- What should the naming convention be for different types of database objects – views, sequences, constraints, indexes, triggers, etc.?
- Should tables have natural keys or surrogate keys?
- How should columns on each table be sequenced?
- Should certain kinds of columns have standardized naming styles, widths and precision? e.g. Description fields across tables could consistently end in _DESC and have the datatype VARCHAR2(500).
- If the name of the database object exceeds Oracle’s limit of 30 characters, how should the name be abbreviated?
- Should each database object name contain a shortcode representing its schema/subject area? If yes, are these shortcodes defined?
4. Categorize tables by nature of data
Every table in a regular OLTP application would fall into one of these broad categories:
- Reference: Tables that contain mostly static reference data and have few columns. This includes data such as currencies, countries, time zones, etc. Master tables change infrequently in structure and content, and when they do change the impact on the application can be high.
- Master: Tables that contain data that has some permanence, such as customer or employee information. Such tables can have more attributes than reference tables and many dependent objects. Masters change more than reference tables, and some masters may change more frequently than others. e.g. An employee changes names rarely, gets a higher designation every few years, and can shift between projects every few months.
- Transaction: Tables associated with masters, which record events or processes about masters. Transaction tables generally have little dependencies of other objects on them. The data in transaction tables changes frequently. e.g. Crediting of an employee’s salary, processing of a customer’s order.
- Staging/Interface: Tables that store intermediate data during a process flow, say while an import program moves data from a source table to a target table across different schemas or databases. The staging table provides a place where data can be validated or transformed before it is moved to its target at a later time.
- Temporary: Tables that are meant to hold data briefly for some temporary use, such as during data upload from a CSV file. Temporary tables are meant to be application helpers and have low data persistence needs — they may hold data only for the duration of a transaction or session.
The data modeling process works smoothest when tables are identified and created in the order of high-to-low dependencies i.e. entities with a larger number of dependent objects are put in place early. This means the reference and master tables first, followed by transaction, then the staging/interface and temporary tables.
The challenge for the data modeler is that requirements gathering is typically structured around business flows. These map more clearly to transaction tables. Reference tables and masters are seldom listed out cleanly, they have to be inferred and derived.
Invest the effort to get at the reference and master tables and model them first.
5. Identify non-functional requirements
During the requirements gathering phase, business analysts tend to focus purely on the functional needs. The non-functional requirements remain unstated or assumed. If these are not unearthed before the physical database is built, there can be nasty surprises and can mean a big impact to the database.
These questions must get addresses early in the process of database designing:
Which entities in the design require logical deletion?
The question is significant in case of reference and master tables for which physical deletion would imply erasing dependent data. Is the need to retain the data in the database even though it is no longer shown to the application, or to cascade physical deletion to the dependent data?
What are the auditing and history requirements?
Some sort of user/date tracking is usually required for each record in the database — who created it and when, who made the most recent update and when. Is there a further level of detail expected — record version numbers, full trace of each database change, etc.?
Which data needs archiving or purging?
Some data is less important to business than others and may require more frequent clean-up e.g. prospective Customer data as opposed to actual Customer data. Has the expected deletion/archiving criteria been captured in the database?
Is there any need for implicit filtering?
Should users get to see data only for their department, legal entity, location in the application? This requirement is easy to miss during development if it was mentioned only in the project scope/plan and not in the use cases handed over to the development team. These filters may not be easy to build retrospectively if they were not modeled for in the first place. Before you build the application, ask the users upfront.
6. Do not skimp on reviews
The cost of insufficient review of the data model is simply too high. If you misspell Invoice ID as Invioce ID on a screen input field or attach the wrong datatype to it, you can easily correct the error during the testing phase. Make an error of a similar nature in the INVOICE database table, and you will wreak havoc when you go about fixing it. All dependent components — database objects, APIs, screens and reports, etc. — will require re-validating and/or re-coding.
Defects in database objects are extremely hard to correct once applications have been developed over them. It is far wiser to spend extra effort reviewing the data model before it gets implemented than to catch bugs in it later.
In a large data modeling team, a multi-pronged review process before physical DB object creation can be effective: one checks the functionality and documentation, two checks the application mapping, three checks the datatypes and constraints, four checks the naming standards. Plan for that time before physical data model implementation and you will save a lot of rework time later.
7. Balance programming ease with data integrity
"Why have you scattered the data across so many tables? Why don’t you put it all in one place?" is the common refrain of application programmers who have to access data from a normalized database.
One of the side-effects of normalization is increase in the number of joins needed to access the data. Application programmers want tables that correspond directly to screens and reports. The data modeler must make sure that data correctness and is not sacrificed at the altar of programming ease. As it also turns out, the supposed simplification in one instance leads to mess in another - this article gives some great insights into the cost of skipping normalization steps.
Cater to programming ease by all means. Build views, APIs, and if necessary, denormalize after considering the impact – but do not build the database simply as a data store for screens. What looks like simplification today will make the database inflexible and un-queryable tomorrow.
8. Add flexfields, but not too early
Flexfields — a feature made popular by Oracle EBS – are generic placeholder columns provided with a table that can store additional information for that table at a later stage. After the application has gone live, an urgent change request or bug fix can become unwieldy to implement if modifying the table structure is not viable. At this stage flexfields play the role of savior. Say, a column INVOICE_FORMAT has to be added to the INVOICE table for a bug fix. Instead of having to alter the table to add that column, an already available flexfield on INVOICE table can get used to store INVOICE_FORMAT.
So far so good, but flexfields also pose the danger of inducing laziness. During development or internal testing, when a need arises to alter a table, the developer takes the quick fix approach and uses a flexfield. This has several disadvantages. Many constraints are not built over flexfields and need to be handled by the application. The purpose of each flexfield in use needs to be correctly documented since the development team needs to be know what the blandly-named ATTRIBUTE1 column is being used for.
Flexfields are not an ideal solution and should be used only as a last resort. A high level of discipline is required to not use flexfields until absolutely necessary. To root out the problem altogether, I’d say do not add flexfields at all till the very last stage of database design.
9. Maintain traceability between use cases and data model
Keep track of the mapping between business requirements and the project artefacts, including the logical and physical entities in the data model. By the end of the project, you should be able to:
1. Ensure by looking at the traceability matrix that each business requirement has been addressed in the database design as required.
2. Analyze the impact of any change in the database on other artefacts – screens, reports, APIs, etc.
10. Follow best practices, but do not let them stand in for reasoned thinking
As they say, last but not least. Pause to think whether your data modeling standards fit into *this* specific case. Questioning will lead you to identify exception scenarios, or even reveal that the standard was not so well-chosen to begin with and may need modification.
___
Any of your own essential tips for data modeling to add?
{ 1 comment… read it below or add one }
Very neat article, captures plenty of ground about data modeling.