Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

multiple data sources

Guest
Mar 10, 2008 Mar 10, 2008
creating a job board that pulls data from at least a dozen external sources, as well as display jobs posted in the database. External sources will be either in xml or need to be screen scraped. What the best way to merge all of this data into one recordset query?

php/mysql
TOPICS
Server side applications
517
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Mar 11, 2008 Mar 11, 2008
jsteinmann wrote:
> creating a job board that pulls data from at least a dozen external sources, as
> well as display jobs posted in the database. External sources will be either
> in xml or need to be screen scraped. What the best way to merge all of this
> data into one recordset query?
>
> php/mysql
>

Your best bet (I think) would be to have a script that was scheduled to
import the data from all the sources into one table, then the webpage
would just display the data, otherwise the page is going to take forever
to load.

How you do this I don't honestly know, sorry.

Steve
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Mar 11, 2008 Mar 11, 2008
I can do that, but here's the catch. The external data would need to be updated every 24 hours.... how would I handle that in the database?
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Mar 11, 2008 Mar 11, 2008
.oO(jsteinmann)

>I can do that, but here's the catch. The external data would need to be
>updated every 24 hours.... how would I handle that in the database?

You would have to automatically fetch the external sources every 24
hours (or more often) to update your own database. On *nix systems this
can be achieved quite easily with a cronjob.

Micha
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Mar 11, 2008 Mar 11, 2008
but do I delete all the old data from the database every 24 hours? I can see how I could update it since there wouldn't really be a way to indentify which job I'm updating, versus what ones are new....
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Mar 12, 2008 Mar 12, 2008
jsteinmann wrote:
> but do I delete all the old data from the database every 24 hours? I can see
> how I could update it since there wouldn't really be a way to indentify which
> job I'm updating, versus what ones are new....
>

That depends on what you are trying to do, what the data is, what its
used for. You could always time stamp each item as it comes on, but
without seeing the full details of the project its hard to say.

Steve
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Mar 12, 2008 Mar 12, 2008
for example, look at an rss feed, new items get added, but the old ones are still there. If I just parse the xml and insert it into the database, i'll end up with duplicates. Let's say I only insert ones with todays date, ok that will work with avoiding duplicates, but if they modify or remove something, that would also need to be done in the database as well, and you can't do that unless you're attaching something that can identify it in the database. I'd like to know how these websites pull data from all these external resources, updating them every 24 hours, but are able to update the information that's parsed from the xml in the database. Otherwise you'd have to somehow merge the xml into your query results.
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Mar 12, 2008 Mar 12, 2008
LATEST
.oO(jsteinmann)

>for example, look at an rss feed, new items get added, but the old ones are
>still there.

Only in your RSS aggregator, which usually keeps old entries. The feed
itself only contains the most recent items.

>If I just parse the xml and insert it into the database, i'll end
>up with duplicates.

Then you should check for dupes before you insert them into the DB.
Dependent on the data the DB might be able to handle this itself, for
example with an INSERT ... ON DUPLICATE KEY UPDATE statement.

>Let's say I only insert ones with todays date, ok that
>will work with avoiding duplicates, but if they modify or remove something,
>that would also need to be done in the database as well, and you can't do that
>unless you're attaching something that can identify it in the database.

Correct. And usually there is something that can be used to identify a
particular record, maybe the name of the source, a title and the date
when it was released for the first time or something like that. These
informations could be merged into an MD5 hash for example to get an
almost unique record identifier.

Micha
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines