??What is NoORM: advantages and disadvantages
NoORM (No Object-Relational Mapping) is an approach to working with databases that rejects the use of traditional ORM (Object-Relational Mapping) frameworks. Instead, developers interact directly with the database using native SQL queries or other specialized data manipulation techniques.
Advantages of NoORM:
1. Query Optimization: Because developers write SQL queries by hand, they can optimize them down to the last detail, often resulting in significant performance improvements over ORM-generated queries.
2. Minimize overhead: Using an ORM adds additional layers of abstraction that can slow down operations. NoORM eliminates these layers, which can also improve performance.
3. Support for complex data structures: NoORM allows you to work with non-standard data structures and relationships that may be difficult to implement through ORM.
4. Process Understanding: Developers have a thorough understanding of how data is accessed and modified, making debugging and optimization easier.
Disadvantages of NoORM:
1. Code Maintenance: Changing the database schema can require updating a lot of code, making the system difficult to maintain and develop.
2. Reduced portability: Code written for one DBMS may require significant changes to work with another DBMS, which reduces the portability of the application.
3. Repetitive code: Without an ORM, developers may find themselves writing the same type of database operations over and over again, which increases code size and reduces readability.
4. Risk of SQL Injection: When writing manual SQL queries, there is a higher risk of errors leading to vulnerabilities such as SQL injection. Developers must be especially careful about validating and escaping input data.
Thus, NoORM is a powerful approach for those who want complete control over database interactions and optimize the performance of their applications. However, it requires a greater level of knowledge and care on the part of developers.
6/12/2024, 3:59:01 PM
??Platform Extension Framework (PXF): advantages and disadvantages
The Platform Extension Framework (PXF) is a powerful tool provided by many modern platforms to extend their functionality. PXF allows developers to create plugins and add-ons that integrate into the core platform, providing system flexibility and extensibility.
Its advantages include:
1. Time-tested solution based on open source code with the possibility of modification to suit your needs
2. Modularity: PXF allows functionality to be divided into independent modules. This makes it easier to develop, test, and maintain code.
3. Extensibility: With PXF, you can easily add new capabilities or integrate with external services and tools, allowing the platform to evolve with your business needs.
4. Speed up development: PXF-enabled platforms often provide ready-made tools and APIs that speed up the development process and make it easier to implement new features.
5. A set of connectors to popular data sources available out of the box (Hadoop stack, data sources available via JDBC, cloud storage).
But there are also a number of disadvantages:
1. The need to support a separate solution based on your own stack.
2. Allocation of resources, as a rule, on the same servers where the DBMS itself is deployed.
3. Multiple transformations and transfer of the same data on the way from representation in the DBMS to the types that PXF itself operates on.
4. Security: Since extensions may have access to sensitive data and platform functions, it is important to ensure their security and prevent possible vulnerabilities.
5. Compatibility: Platform updates may introduce compatibility issues with existing extensions, requiring additional testing and adaptation.
The Platform Extension Framework provides powerful capabilities for extending and adapting platforms, allowing developers to create custom solutions and improve system functionality. However, it is important to consider the potential challenges and risks associated with integrating and supporting extensions in order to maximize the potential of PXF.
6/10/2024, 3:59:01 PM
??Dataset of interactions with ChatGPT
is a dataset of 1 million real user interactions with ChatGPT, characterized by a wide range of languages and a variety of prompts.
It was collected by providing free access to everyone to ChatGPT and GPT-4 in exchange for collecting chat history.
Using this dataset, the developers created the Llama-2-based WildLlama-7b-user-assistant bot WildLlama-7b-user-assistant, which is capable of predicting both the user's prompts and the responses that ChatGPT might choose.
The following script can also be used to load a dataset:
from datasets import load_dataset
dataset = load_dataset(“allenai/WildChat-1M”)
6/7/2024, 3:59:01 PM
?TOP DS-events all over the world in June
Jun 2-4 - AI Con USA 2024 - Las Vegas, USA -
Jun 3-4 - Institute for Data Science and Artificial Intelligence Conference 2024 - Manchester, UK -
Jun 5 - Digital transformation summit - RIYADH, Saudi Arabia -
Jun 5-6 - AI & Big Data Expo North America 2024 - Santa Clara, USA -
Jun 5-6 - Big Data and Analytics Summit - Ontario, Canada -
Jun 12-13 - The AI Summit - London, United Kingdom -
Jun 17-19 - Data Science & Statistics - Amsterdam, Netherlands -
Jun 18 - The Martech Summit - Jakarta, Indonesia -
Jun 20 - Data Architecture Melbourne - Melbourne, Australia -
Jun 25 - Data Architecture Sydney - Sydney, Australia -
Jun 25-28 - MLCon Munich - Munich, Germany / Online -
Jun 26-28 - International Conference on Distributed Computing and Artificial Intelligence (DCAI) - Salamanca, Spain -
5/31/2024, 3:59:01 PM
??Useful GitHub repositories for master data development and beginners
- Contains a list of tools, frameworks and libraries for data engineering, making it a great starting point for those looking to dive into the field.
is a comprehensive course that provides hands-on experience in data engineering.
is a collection of articles and tutorials covering various aspects of data engineering, including data entry, data processing, and data warehousing.
is a list of open source data engineering tools that will be a goldmine for anyone who wants to contribute or use them to create real data engineering projects. It contains a wealth of information about open source tools and frameworks, making it an excellent resource for those who want to explore alternative data engineering solutions.
is a comprehensive collection of resources covering all aspects of data engineering. It includes tutorials, articles and books on all topics related to data engineering. Whether you're looking for a quick reference guide or in-depth knowledge, this reference book has something for every level of data engineer.
is a community-created wiki that provides a comprehensive resource for learning data engineering. This repository covers a wide range of topics including data pipelines, data warehouses, and data modeling.
- Offers a practical approach to learning data engineering. It features practical projects and exercises that will help you apply your knowledge and skills to real-life scenarios. By completing these projects, you will gain hands-on experience and build a portfolio that demonstrates your data engineering capabilities.
5/30/2024, 3:59:01 PM
??A small selection of not very popular but useful libraries for data analysis
- provides a spreadsheet user interface for Python. Use Pandas, create charts, import Excel sheets, analyze data and create reports.
- converts Python programs and data into WebAssembly and runs them 3x faster.
- is a Python library that uses LLM for data cleaning tasks such as categorization, transformation, and extraction.
5/27/2024, 3:59:01 PM
? Large Feedback Dataset
is a large multimodal recall dataset. The dataset is built using open source models to provide high quality feedback.
The RLAIF-V-Dataset is a novel method of using open source MLLM to provide high quality feedback from deconfined model responses. By training on this data, models can achieve a higher level of confidence than existing open source models.
Load a dataset using a Python script as follows:
from datasets import load_dataset
dataset = load_dataset("HaoyeZhang/RLAIF-V-Dataset")
5/24/2024, 3:59:04 PM
??An assistant for interacting with any kind of data
is a fully customizable personal assistant for querying and interacting with your data, locally or deployed via the cloud.
It can also answer questions related to your documents and retrieve information from existing knowledge bases.
Verba perfectly combines state-of-the-art RAG technology with Weaviate's context-aware database.
5/22/2024, 3:59:04 PM
?TOP DS-events all over the world in May
May 4 - SQL Saturday - Jacksonville, USA -
May 7-9 - Real-Time Analytics Summit - San Jose, USA -
May 8 - Data Connect West - Portland, USA -
May 8-9 - UNLEASH THE POWER OF YOUR DATA - Boston, USA -
May 8-9 - Data Innovation Summit - Dubai, UAE -
May 9 - Conversational AI Innovation Summit - San Francisco, USA -
May 15-17 - World Data Summit - Amsterdam, The Netherlands -
May 16 - Spatial Data Science Conference 2024 - London, UK -
May 18 - DSF MAYDAY - London, UK -
May 21 - Deployment, Utilization & Optimization of Enterprise Generative AI - Silicon Valley, USA -
May 23-24 - The Data Science Conference - Chicago, USA -
4/30/2024, 3:58:58 PM
?Selection of ETL services for Big Data
- A cloud solution that allows you to integrate 28 enterprise data sources with popular data warehouses like Snowflake and BigQuery, the service allows a team of engineers and analysts to integrate third-party tools and create pipelines in a couple of minutes data without code. For example, you can set up Facebook Ads integration with BigQuery in four clicks. There is no need to involve developers to work with Renta Marketing ETL.
- Cloud-based software that allows users to quickly and easily create pipelines. The platform supports more than 90 sources. Fivertran provides a set of ready-made integrations, so even novice developers can understand the service.
- the service provides users with more than 150 ready-made integrations. You can set up integrations in three simple steps. The result is a pipeline that copies data to storage and requires no maintenance.
is a low-code application for creating pipelines. With Matillion, teams can create pipelines and automate data processing. The service has a simple interface, so a user who is far from programming can create and change data. Marillion supports real-time processing. The tool supports popular data sources and makes it easy to identify and resolve data problems.
is an ETL solution designed for small businesses and marketers who primarily use Facebook Ads, Google Ads and Google Analytics. The tool has a built-in application on the Google Cloud Platform that allows you to export data directly to Google BigQuery.
4/29/2024, 3:58:58 PM
??Selection of tools for working with Big Data
- Layers on top of multiple data sources, allowing users to query a wide range of information in a variety of formats, from Hadoop sequence files and server logs to NoSQL databases and cloud-based object stores.
Druid () is a real-time analytics database that provides low query latency, high concurrency, multi-user capabilities, and instant visibility into streaming data. According to its proponents, multiple end users can simultaneously query data stored in Druid without any performance impact.
is a big data platform developed by LexisNexis and open sourced in 2011. In accordance with its full name - High-Performance Computing Cluster - the technology is essentially a cluster of computers created on the basis of standard hardware for processing, managing and delivering big data.
is an open table format used to manage data in lakes, achieved in part by tracking individual files of information in tables rather than directories. Created by Netflix for use with its petabyte-sized tables, Iceberg is now an Apache project. Iceberg is typically "used in production, where a single table can contain tens of petabytes of data."
is a distributed information storage and analytics platform for big data. It provides an analytical information processing (OLAP) engine designed to work with very large data sets. Because Kylin is built on top of other Apache technologies, including Hadoop, Hive, Parquet and Spark, its proponents say it can easily scale to handle large volumes of data.
is a distributed stream processing system created by LinkedIn and is currently an open source project managed by Apache. The system can run on top of Hadoop YARN or Kubernetes, and a standalone deployment option is also offered. According to the developers, Samza can process "several terabytes" of data state information with low latency and high throughput for fast analysis.
4/26/2024, 3:58:58 PM
????Dataset catalog for object detection and segmentation
SAM + Optical Flow = FlowSAM
is a new tool for detecting and segmenting moving objects in video, which significantly outperforms all previous models, both for a single object and for multiple objects
To train the model, many datasets were used, which became available for download at this
4/24/2024, 3:59:01 PM
?Dataset for detecting problems in code
is a dataset that was designed to provide a diverse set of codebase issues that could be verified using unit tests in repositories. The full SWE-bench split includes 2,294 issue-commit pairs across 12 python repositories.
Thus, the dataset offers a new task: solving problems in the presence of a complete repository and an issue on GitHub.
To load a dataset using a Python script, you can use the following command:
from datasets import load_dataset
dataset = load_dataset("princeton-nlp/SWE-bench")
4/22/2024, 3:59:01 PM
?SQL vs NoSQL: advantages and disadvantages
SQL and NoSQL are the two main approaches to data storage and processing. Each has its own advantages and disadvantages, and the choice between them depends on the specific needs of the project. Let's look at the main differences between them.
SQL (Structured Query Language) databases are relational databases as well as DBMS of the Hadoop ecosystem that use structured tables to store data.
Benefits of SQL:
1. Data structure: Tables, relationships and diagrams make data easily understandable and manageable.
2. ACID Consistency: SQL databases ensure compliance with ACID principles (atomicity, consistency, isolation, durability), ensuring transaction reliability.
3. Universal Query Language: SQL provides a rich and versatile set of tools for performing complex queries and analytics. In some DBMSs there may be only slight deviations in the form of SQL dialects
Disadvantages of SQL:
1. Horizontal scaling: Traditional SQL databases often face scaling limitations when dealing with large volumes of data.
2. Schema Complexity: Changing the data schema can be a complex and costly process.
3. Limited flexibility: Some SQL databases may have restrictions on data types or structures that may not be suitable for some types of data.
NoSQL databases, on the other hand, do not use traditional tables but instead offer flexible data models.
Advantages of NoSQL:
1. Data structure flexibility: NoSQL databases can easily scale and change without the need to rebuild the schema.
2. Horizontal scaling: Many NoSQL databases easily scale horizontally, providing high performance for large volumes of data.
3. Support for unstructured data: NoSQL databases are well suited for storing and processing unstructured data such as text, images and videos.
Disadvantages of NoSQL:
1. Lack of ACID support: Many NoSQL databases sacrifice ACID consistency for performance or flexibility.
2. Consistency Difficulty: When scaling and distributing data, NoSQL databases can face challenges maintaining data consistency.
Thus, depending on the project requirements and priorities for performance, scalability and flexibility, the choice between SQL and NoSQL databases may be different.
4/19/2024, 3:59:01 PM
??Where can I get the data? Multiple open repositories
- Github repository. List of open datasets with direct download links. There is data with videos, pictures, audio, and everything in general.
is a source that includes 20k+ datasets. There are also libraries for Python and R.
- data warehouse from AWS. There are some datasets here that you won't find anywhere else.
- collections of datasets that were used in real studies
- a repository in which Datasets are conveniently divided by application areas (NLP, CV, etc.)
4/17/2024, 3:59:01 PM
?A large dataset for speech detection with a size of more than 150 thousand hours in 6000+ languages has been published
The contains about 150 thousand hours of audio in more than 6,000 languages. The number of unique ISO codes in a given dataset does not coincide with the actual number of languages, since similar languages can be encoded with the same code.
The data was labeled for the voice detection task at a time sampling of approximately 30 milliseconds (or 512 samples at a sampling rate of 16 kilohertz).
4/15/2024, 3:59:01 PM
??Data used when training the MA-LMM model
MA-LMM (Memory-Augmented Large Multimodal Model) is a large memory-augmented multimodal model for understanding the context of long videos.
The model allows the use of long context by significantly reducing GPU memory usage. Instead of trying to process more frames at once, like most existing models, MA-LMM processes video online while storing past information in a memory bank.
The data on which the model was trained was made publicly available. This data consists of 2 very large datasets that can be downloaded from this
4/12/2024, 3:59:01 PM
???Open Source Synthetic Text-to-SQL Dataset
Gretel releases largest open source dataset to speed up training of AI models
As of April 2024, the dataset is believed to be the largest and most diverse synthetic text-to-SQL conversion dataset available today, according to the developers.
The dataset contains approximately 23 million tokens, including approximately 12 million SQL tokens, and a wide range of SQL complexity levels, including subqueries, single joins, multiple joins, aggregations, window functions, and set operations.
To load a dataset via the Python API, you need to write the following script:
from datasets import load_dataset
dataset = load_dataset("gretelai/synthetic_text_to_sql")
4/10/2024, 3:59:01 PM