Showing posts with label advice. Show all posts
Showing posts with label advice. Show all posts

Wednesday, March 21, 2012

How to find transposed data and near misses

I would like some advice on a data and query problem I face. I have a
data table with a "raw key" value which is not guaranteed to be valid
at its source. Normally, this value will be 9 numeric digits and map to
a "names" table where the entity is given assigned an "official name".

My problem is that I'd like to be able to identify data values that are
"close" to being "correct". For example, in the case of a
nine digit number such as 077467881, I'd like to be able to identify
rows with values close to this raw string. That is, if
there were a row with a value for this column that was "off" by say, a
transposed single digit (such as 077647881 in this example)
I would like to find a query to locate the "close candidates" in a
result set. If I can find rows having a raw key
value that is close to a "good key" then I can allow my user to use
other criteria to possibly assign the "close key" as
an alternate or alias of the official key. Here is part of my schema:

CREATE TABLE MYData (
StateCD char (2) NOT NULL ,
CountyCD char (3) NOT NULL ,
MYID int NULL ,
RawNumString varchar(9) NULL ,
SaleMnYear datetime NOT NULL ,
NumberWidgets int NOT NULL ,
)

CREATE TABLE MYNames (
MYID int IDENTITY (1, 1) NOT NULL ,
OfficialName varchar (70) NOT NULL ,
CONSTRAINT PK_MYNames PRIMARY KEY CLUSTERED
(
MYID
)
)
CREATE TABLE MYAltID (
RawNumString varchar (9) NOT NULL ,
MYID int NOT NULL ,
CONSTRAINT PK_MYALTID PRIMARY KEY CLUSTERED
(
RawNumString
) ,
CONSTRAINT FK_HasName FOREIGN KEY
(
MYID
) REFERENCES MYNames (
MYID
)
)
So, how to generalize something like:
SELECT * FROM MYData WHERE RawNumString = '077467881'
OR RawNumString = '077647881'For what it's worth...

The LIKE operator can perform several forms of wildcard comparisons against
2 strings. For example:

if '90120' like '9_120' print 'Yes' else print 'No'
if '90120' like '9012[0..9]' print 'Yes' else print 'No'
if '90120' like '*0120' print 'Yes' else print 'No'

Yes
Yes
Yes

The SoundEx function returns a checksum for a character string, but not
numbers. It basically disregards vowels and double letters and returns a 4
char result. For example:

print soundex('Robert')
print soundex('Roberto')
print soundex('Rabertie')
print soundex('Rabbit')
print soundex('Rob')

R163
R163
R163
R130
R100

These can be included in a where clause. For example:
SELECT * FROM MYData WHERE RawNumString like '*7746*'
SELECT * FROM MYData WHERE SoundEx(RawName) = SoundEx('Francesco')

Keep in mind that performing like or soundex comparisons do not take
advantage of indexes, so performance could be a problem on a large table.

"JJA" <johna@.cbmiweb.com> wrote in message
news:1117641848.807351.148700@.g47g2000cwa.googlegr oups.com...
> I would like some advice on a data and query problem I face. I have a
> data table with a "raw key" value which is not guaranteed to be valid
> at its source. Normally, this value will be 9 numeric digits and map to
> a "names" table where the entity is given assigned an "official name".
> My problem is that I'd like to be able to identify data values that are
> "close" to being "correct". For example, in the case of a
> nine digit number such as 077467881, I'd like to be able to identify
> rows with values close to this raw string. That is, if
> there were a row with a value for this column that was "off" by say, a
> transposed single digit (such as 077647881 in this example)
> I would like to find a query to locate the "close candidates" in a
> result set. If I can find rows having a raw key
> value that is close to a "good key" then I can allow my user to use
> other criteria to possibly assign the "close key" as
> an alternate or alias of the official key. Here is part of my schema:
> CREATE TABLE MYData (
> StateCD char (2) NOT NULL ,
> CountyCD char (3) NOT NULL ,
> MYID int NULL ,
> RawNumString varchar(9) NULL ,
> SaleMnYear datetime NOT NULL ,
> NumberWidgets int NOT NULL ,
> )
> CREATE TABLE MYNames (
> MYID int IDENTITY (1, 1) NOT NULL ,
> OfficialName varchar (70) NOT NULL ,
> CONSTRAINT PK_MYNames PRIMARY KEY CLUSTERED
> (
> MYID
> )
> )
> CREATE TABLE MYAltID (
> RawNumString varchar (9) NOT NULL ,
> MYID int NOT NULL ,
> CONSTRAINT PK_MYALTID PRIMARY KEY CLUSTERED
> (
> RawNumString
> ) ,
> CONSTRAINT FK_HasName FOREIGN KEY
> (
> MYID
> ) REFERENCES MYNames (
> MYID
> )
> )
> So, how to generalize something like:
> SELECT * FROM MYData WHERE RawNumString = '077467881'
> OR RawNumString = '077647881'|||[posted and mailed, please reply in ews]

JJA (johna@.cbmiweb.com) writes:
> I would like some advice on a data and query problem I face. I have a
> data table with a "raw key" value which is not guaranteed to be valid
> at its source. Normally, this value will be 9 numeric digits and map to
> a "names" table where the entity is given assigned an "official name".
> My problem is that I'd like to be able to identify data values that are
> "close" to being "correct". For example, in the case of a
> nine digit number such as 077467881, I'd like to be able to identify
> rows with values close to this raw string. That is, if
> there were a row with a value for this column that was "off" by say, a
> transposed single digit (such as 077647881 in this example)
> I would like to find a query to locate the "close candidates" in a
> result set. If I can find rows having a raw key
> value that is close to a "good key" then I can allow my user to use
> other criteria to possibly assign the "close key" as
> an alternate or alias of the official key. Here is part of my schema:

Fuzzy logic is not for the faint of heart, and it's definitely not my
area of expertise.

Assuming that you always have nine digits, one approach is compare
character by character and if 7 or more match, count this as a possible
match:

SELECT *
FROM tbl
WHERE CASE WHEN substring(col, 1, 1) = substring(@.val, 1, 1)
THEN 1 ELSE 0
END +
CASE WHEN substring(col, 2, 1) = substring(@.val, 2, 1)
THEN 1 ELSE 0
END +
...
CASE WHEN substring(col, 9, 1) = substring(@.val, 9, 1)
THEN 1 ELSE 0
END >= 7

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||Have you ever worked with check digits before? They can prevent errors
in data entry instead of trying to patch them after the fact. The idea
of keeping an invalid key does not sound like a good design.|||Yes, I know this is not good design but we are getting a raw data file
from another organization and we have no control over their practices.
Most occurrences of this number are "valid" but it is clear from
looking at the data that there is no validation at the source. The
nature of the data is such that if we can identify 7 or 8 bytes of data
as being the same as another 9 byte and valid "key", we could assume
the key could be improved to point at the same 9 byte valid entity. So,
I thought I'd run this notion past the world of experts for some ideas.|||Thanks very much for this neat suggestion. It is exactly what I hoped
for and I can implement this a stored procedure with a couple of
parameters. I will provide a little interface where the analyst can
launch the sproc and see if there are any "near-misses". Very cool
application of the CASE facility. Thanks again.|||JJA (johna@.cbmiweb.com) writes:
> Thanks very much for this neat suggestion. It is exactly what I hoped
> for and I can implement this a stored procedure with a couple of
> parameters. I will provide a little interface where the analyst can
> launch the sproc and see if there are any "near-misses". Very cool
> application of the CASE facility. Thanks again.

Glad to hear that the idea was useful to use. Whether it suffices remains
to see. As I said that fuzzy-logic stuff is horrible.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp

How to find transposed data and near misses

I would like some advice on a data and query problem I face. I have a
data table with a "raw key" value which is not guaranteed to be valid
at its source. Normally, this value will be 9 numeric digits and map to
a "names" table where the entity is given assigned an "official name".
My problem is that I'd like to be able to identify data values that are
"close" to being "correct". For example, in the case of a
nine digit number such as 077467881, I'd like to be able to identify
rows with values close to this raw string. That is, if
there were a row with a value for this column that was "off" by say, a
transposed single digit (such as 077647881 in this example)
I would like to find a query to locate the "close candidates" in a
result set. If I can find rows having a raw key
value that is close to a "good key" then I can allow my user to use
other criteria to possibly assign the "close key" as
an alternate or alias of the official key. Here is part of my schema:
CREATE TABLE MYData (
StateCD char (2) NOT NULL ,
CountyCD char (3) NOT NULL ,
MYID int NULL ,
RawNumString varchar(9) NULL ,
SaleMnYear datetime NOT NULL ,
NumberWidgets int NOT NULL ,
)
CREATE TABLE MYNames (
MYID int IDENTITY (1, 1) NOT NULL ,
OfficialName varchar (70) NOT NULL ,
CONSTRAINT PK_MYNames PRIMARY KEY CLUSTERED
(
MYID
)
)
CREATE TABLE MYAltID (
RawNumString varchar (9) NOT NULL ,
MYID int NOT NULL ,
CONSTRAINT PK_MYALTID PRIMARY KEY CLUSTERED
(
RawNumString
) ,
CONSTRAINT FK_HasName FOREIGN KEY
(
MYID
) REFERENCES MYNames (
MYID
)
)
So, how to generalize something like:
SELECT * FROM MYData WHERE RawNumString = '077467881'
OR RawNumString = '077647881'For what it's worth...
The LIKE operator can perform several forms of wildcard comparisons against
2 strings. For example:
if '90120' like '9_120' print 'Yes' else print 'No'
if '90120' like '9012[0..9]' print 'Yes' else print 'No'
if '90120' like '*0120' print 'Yes' else print 'No'
Yes
Yes
Yes
The SoundEx function returns a checksum for a character string, but not
numbers. It basically disregards vowels and double letters and returns a 4
char result. For example:
print soundex('Robert')
print soundex('Roberto')
print soundex('Rabertie')
print soundex('Rabbit')
print soundex('Rob')
R163
R163
R163
R130
R100
These can be included in a where clause. For example:
SELECT * FROM MYData WHERE RawNumString like '*7746*'
SELECT * FROM MYData WHERE SoundEx(RawName) = SoundEx('Francesco')
Keep in mind that performing like or soundex comparisons do not take
advantage of indexes, so performance could be a problem on a large table.
"JJA" <johna@.cbmiweb.com> wrote in message
news:1117641848.807351.148700@.g47g2000cwa.googlegroups.com...
> I would like some advice on a data and query problem I face. I have a
> data table with a "raw key" value which is not guaranteed to be valid
> at its source. Normally, this value will be 9 numeric digits and map to
> a "names" table where the entity is given assigned an "official name".
> My problem is that I'd like to be able to identify data values that are
> "close" to being "correct". For example, in the case of a
> nine digit number such as 077467881, I'd like to be able to identify
> rows with values close to this raw string. That is, if
> there were a row with a value for this column that was "off" by say, a
> transposed single digit (such as 077647881 in this example)
> I would like to find a query to locate the "close candidates" in a
> result set. If I can find rows having a raw key
> value that is close to a "good key" then I can allow my user to use
> other criteria to possibly assign the "close key" as
> an alternate or alias of the official key. Here is part of my schema:
> CREATE TABLE MYData (
> StateCD char (2) NOT NULL ,
> CountyCD char (3) NOT NULL ,
> MYID int NULL ,
> RawNumString varchar(9) NULL ,
> SaleMnYear datetime NOT NULL ,
> NumberWidgets int NOT NULL ,
> )
> CREATE TABLE MYNames (
> MYID int IDENTITY (1, 1) NOT NULL ,
> OfficialName varchar (70) NOT NULL ,
> CONSTRAINT PK_MYNames PRIMARY KEY CLUSTERED
> (
> MYID
> )
> )
> CREATE TABLE MYAltID (
> RawNumString varchar (9) NOT NULL ,
> MYID int NOT NULL ,
> CONSTRAINT PK_MYALTID PRIMARY KEY CLUSTERED
> (
> RawNumString
> ) ,
> CONSTRAINT FK_HasName FOREIGN KEY
> (
> MYID
> ) REFERENCES MYNames (
> MYID
> )
> )
> So, how to generalize something like:
> SELECT * FROM MYData WHERE RawNumString = '077467881'
> OR RawNumString = '077647881'
>|||[posted and mailed, please reply in ews]
JJA (johna@.cbmiweb.com) writes:
> I would like some advice on a data and query problem I face. I have a
> data table with a "raw key" value which is not guaranteed to be valid
> at its source. Normally, this value will be 9 numeric digits and map to
> a "names" table where the entity is given assigned an "official name".
> My problem is that I'd like to be able to identify data values that are
> "close" to being "correct". For example, in the case of a
> nine digit number such as 077467881, I'd like to be able to identify
> rows with values close to this raw string. That is, if
> there were a row with a value for this column that was "off" by say, a
> transposed single digit (such as 077647881 in this example)
> I would like to find a query to locate the "close candidates" in a
> result set. If I can find rows having a raw key
> value that is close to a "good key" then I can allow my user to use
> other criteria to possibly assign the "close key" as
> an alternate or alias of the official key. Here is part of my schema:
Fuzzy logic is not for the faint of heart, and it's definitely not my
area of expertise.
Assuming that you always have nine digits, one approach is compare
character by character and if 7 or more match, count this as a possible
match:
SELECT *
FROM tbl
WHERE CASE WHEN substring(col, 1, 1) = substring(@.val, 1, 1)
THEN 1 ELSE 0
END +
CASE WHEN substring(col, 2, 1) = substring(@.val, 2, 1)
THEN 1 ELSE 0
END +
..
CASE WHEN substring(col, 9, 1) = substring(@.val, 9, 1)
THEN 1 ELSE 0
END >= 7
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se
Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||Have you ever worked with check digits before? They can prevent errors
in data entry instead of trying to patch them after the fact. The idea
of keeping an invalid key does not sound like a good design.|||Yes, I know this is not good design but we are getting a raw data file
from another organization and we have no control over their practices.
Most occurrences of this number are "valid" but it is clear from
looking at the data that there is no validation at the source. The
nature of the data is such that if we can identify 7 or 8 bytes of data
as being the same as another 9 byte and valid "key", we could assume
the key could be improved to point at the same 9 byte valid entity. So,
I thought I'd run this notion past the world of experts for some ideas.|||Thanks very much for this neat suggestion. It is exactly what I hoped
for and I can implement this a stored procedure with a couple of
parameters. I will provide a little interface where the analyst can
launch the sproc and see if there are any "near-misses". Very
application of the CASE facility. Thanks again.|||JJA (johna@.cbmiweb.com) writes:
> Thanks very much for this neat suggestion. It is exactly what I hoped
> for and I can implement this a stored procedure with a couple of
> parameters. I will provide a little interface where the analyst can
> launch the sproc and see if there are any "near-misses". Very
> application of the CASE facility. Thanks again.
Glad to hear that the idea was useful to use. Whether it suffices remains
to see. As I said that fuzzy-logic stuff is horrible.
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se
Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp

how to find the unique key and foreign key reference given the table name in MS SQL

hii...
i m trying to find the Unique key and the foreign key reference in MS-SQL.
plz any one advice me how to do this.
thank u.
radhamohan

Quote:

Originally Posted by radhamohan

hii...
i m trying to find the Unique key and the foreign key reference in MS-SQL.
plz any one advice me how to do this.
thank u.
radhamohan


Hi. You will need to explain this in much more detail. Not sure what you mean by 'reference'sql